STAT515 Final Project

1.Introduction

Depression is one of the mental disorders affecting millions of people around the world. Public health needs to study and understand the causes of depression and its risk factors. Depression can have severe consequences on an individual’s life, and it will reduce life quality and increase the risk of suicide. Therefore, we have chosen to focus our efforts on investigating depression and the factors causing that. For this reason we decide to work on depression.

2.Background

As the prevalence of depression and anxiety disorders rises globally, we face increasing public health challenges. As shown in the figure, the prevalence of anxiety and depression varies significantly from country to country, suggesting the need to develop and implement effective treatment strategies tailored to each region. Global attention and treatment for depression and anxiety disorders is urgent, and we must take action to reduce the health burden of these disorders.

This visualization is an interactive map with the addition of the shiny feature, which means that we have taken into account the needs of some colorblind groups, and we can switch the background color at will so that different viewers can access the information in the graphic.

library(leaflet)
library(dplyr)
library(countrycode)
library(raster)
library(rnaturalearth)
library(ggplot2)
library(sf)
library(tmap)
library(tidyverse)

# load data
data <- read.csv("/Users/zijiehe/Desktop/STAT515FINAL/share-who-report-lifetime-anxiety-or-depression.csv")

# Data preprocessing
data <- data[!is.na(data$Code) & data$Code != "",]

# rnaturalearth package
world <- ne_countries(scale = "medium", returnclass = "sf")

# join function application
world <- merge(world, data, by.x = "iso_a3_eh", by.y = "Code")

# color gradient
world$Rate <- as.numeric(world$Rate)
# define the color 


# shiny function
world_sf <- st_as_sf(world)

tmap_mode("view")

tm <- tm_shape(world_sf) +
  tm_polygons("Rate", palette = "-RdYlBu", title = "Rate of Depression") +
  tm_layout(frame = FALSE, title = "Global Rate of Depression")

tm

This image is a world map showing the global prevalence of anxiety and depression. The different colors on the map represent the prevalence of anxiety and depression in different countries, ranging from 0% to 50%. For example, the dark red color represents a prevalence of 40% to 50%, while the light yellow color indicates a prevalence of 0% to 10%. As can be seen from the figure, anxiety and depression are unevenly distributed globally, with some countries having significantly higher prevalence rates than others.

3.Depression Dataset

3.1 Introduction

3.1.1 Basic information

The depression dataset is taken from https://ourworldindata.org/ and it includes 1147 individual records and 36 attributes such as sex ,age marital status , Number of children in the household, Household size, Years of education, Consumption of nondurable goods, Value of durable assets,value of cell phone assets,Savings assets,Total owned land assets,Total food consumption,Alcohol consumption,Tobacco consumption,Consumption of medical care,Consumption of children’s medical care,Consumption of education,Consumption of social activities,Other consumption, Nonagricultural income,Flow cost of nonagricultural business,Total cost,Frequency of purchasing full-price food items on a regular basis,How often children buy full-price food,Meat food consumption,Whether the diet is adequate,Frequency of sleep deprivation due to hunger, Number of days household members were sick, Number of deaths of children under five years old,Expenses on education,School attendance rate,Investment in durable goods,Investment in nondurable goods,Depressed status. This dataset does not include any missing values but it is not balanced, the numbeor of records that have depressed value equal to 0 is 953 and the number of records which have depression value equal to 1 is 194.

3.1.2 The meanings of each variable

library(ggplot2)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(corrplot)
## corrplot 0.92 loaded
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:raster':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
data <- read.csv("/Users/zijiehe/Desktop/STAT515FINAL/dep.csv") 
# set the depression as Y
colnames(data)[ncol(data)] <- "depression"
str(data)


# numeric
sapply(data, is.numeric)
cor_matrix <- cor(data, use = "complete.obs")
print(cor_matrix)
variable_names <- c("sex", "age", "marital_status", "children", "household_size", "years_of_edu", 
                    "hh_children", "cons_nondurable", "asset_durable", "asset_phone", "asset_savings", 
                    "asset_land_owned_total", "cons_allfood", "cons_alcohol", "cons_tobacco", "cons_med_total", 
                    "cons_med_children", "cons_ed", "cons_social", "cons_other", "ent_nonag_revenue", 
                    "ent_nonag_flowcost", "ent_total_cost", "fs_adwholed_often", "fs_chwholed_often", 
                    "fs_meat", "fs_enoughtom", "fs_sleephun", "med_sickdays_hhave", "med_u5_deaths", 
                    "ed_expenses", "ed_schoolattend", "durable_investment", "nondurable_investment", "depressed")

variable_descriptions <- c("Sex, may be 1 for male, 0 for female.", "Age.", "Marital status, may be 1 for married, 0 for unmarried.", 
                           "The number of children in the household.", "Household size.", "Years of education.", 
                           "The number of children in the household.", "Consumption of non-durable goods.", "Value of durable assets.", 
                           "The value of cell phone assets.", "Savings assets.", "Total owned land assets.", 
                           "Total food consumption.", "Alcohol consumption.", "Tobacco consumption.", 
                           "Consumption of medical care.", "Consumption of children's medical care.", "Consumption of education.", 
                           "Consumption of social activities.", "Other consumption.", "Nonagricultural income.", 
                           "Flow cost of non-agricultural business.", "Total cost.", "Frequency of purchasing full-price food items on a regular basis.", 
                           "How often children buy full-price food.", "Meat food consumption.", "Whether the diet is adequate.", 
                           "Frequency of sleep deprivation due to hunger.", "Number of days household members were sick.", 
                           "The number of deaths of children under five years old.", "Expenses on education.", "School attendance rate.", 
                           "Investment in durable goods.", "Investment in nondurable goods.", "Depressed status, where 0 is not depressed and 1 is depressed.")

variable_data <- data.frame(Variable = variable_names, Description = variable_descriptions)
kable(variable_data, format = "html", table.attr = "style='width:100%;'")
Variable Description
sex Sex, may be 1 for male, 0 for female.
age Age.
marital_status Marital status, may be 1 for married, 0 for unmarried.
children The number of children in the household.
household_size Household size.
years_of_edu Years of education.
hh_children The number of children in the household.
cons_nondurable Consumption of non-durable goods.
asset_durable Value of durable assets.
asset_phone The value of cell phone assets.
asset_savings Savings assets.
asset_land_owned_total Total owned land assets.
cons_allfood Total food consumption.
cons_alcohol Alcohol consumption.
cons_tobacco Tobacco consumption.
cons_med_total Consumption of medical care.
cons_med_children Consumption of children’s medical care.
cons_ed Consumption of education.
cons_social Consumption of social activities.
cons_other Other consumption.
ent_nonag_revenue Nonagricultural income.
ent_nonag_flowcost Flow cost of non-agricultural business.
ent_total_cost Total cost.
fs_adwholed_often Frequency of purchasing full-price food items on a regular basis.
fs_chwholed_often How often children buy full-price food.
fs_meat Meat food consumption.
fs_enoughtom Whether the diet is adequate.
fs_sleephun Frequency of sleep deprivation due to hunger.
med_sickdays_hhave Number of days household members were sick.
med_u5_deaths The number of deaths of children under five years old.
ed_expenses Expenses on education.
ed_schoolattend School attendance rate.
durable_investment Investment in durable goods.
nondurable_investment Investment in nondurable goods.
depressed Depressed status, where 0 is not depressed and 1 is depressed.

3.2 Explore the data

Before modeling, it is essential to preprocess and prepare the dataset to ensure it is suitable for analysis. As a first step, we check for any missing values in the dataset. Fortunately, there are no missing values present. Next, we assess whether the dataset is balanced. To achieve this, we employ a bar plot to visualize the distribution of the target variable. The plot below indicates there are particularly few non-depression categories comparing to depression values and the dataset is not balanced for depression.

#######explore the data
library(ggplot2)

data <- read.csv("/Users/zijiehe/Desktop/STAT515FINAL/dep.csv")
depressed_counts <- table(data$depressed)

ggplot(data = as.data.frame(depressed_counts), aes(x = Var1, y = Freq, fill = as.factor(Var1))) +
  geom_bar(stat = "identity") +
  labs(x = "Depression Status", y = "Count", fill = "Depression Status") +
  ggtitle("Distribution of Depression Status in the Dataset")

########distribution(each variable)
library(shiny)

# UI
ui <- fluidPage(
  titlePanel("Depression Data Exploration"),
  sidebarLayout(
    sidebarPanel(
      selectInput("variable", "Choose a variable:", choices = colnames(data)[-length(colnames(data))]),
      radioButtons("plotType", "Select plot type:", choices = c("Bar Plot" = "bar", "Box Plot" = "box"))
    ),
    mainPanel(
      plotOutput("plot")
    )
  )
)

# Server
server <- function(input, output) {
  output$plot <- renderPlot({
    if (input$plotType == "bar") {
      # for categorical
      data_to_plot <- table(data[[input$variable]], data$depressed)
      ggplot(as.data.frame(data_to_plot), aes(x = Var1, y = Freq, fill = factor(Var2))) +
        geom_bar(stat = "identity", position = "dodge") +
        labs(x = input$variable, y = "Count", fill = "Depression Status") +
        ggtitle(paste("Distribution of", input$variable, "by Depression Status"))
    } else if (input$plotType == "box") {
      # for continuous variables
      ggplot(data, aes(x = factor(depressed), y = data[[input$variable]], fill = factor(depressed))) +
        geom_boxplot() +
        labs(x = "Depression Status", y = input$variable) +
        ggtitle(paste("Distribution of", input$variable, "by Depression Status"))
    }
  })
}

# shint application
shinyApp(ui = ui, server = server)
## PhantomJS not found. You can install it with webshot::install_phantomjs(). If it is installed, please make sure the phantomjs executable can be found via the PATH variable.
Shiny applications not supported in static R Markdown documents
Shiny
Shiny
Shiny
Shiny

4. Data preprocessing

4.1 Remove the outliers

remove_outliers <- function(dataset, threshold = 3) {
  # Calculate mean and standard deviation for each attribute
  mean_values <- apply(dataset, 2, mean)
  std_dev_values <- apply(dataset, 2, sd)
  
  # Calculate Z-scores for each observation
  z_scores <- scale(dataset, center = mean_values, scale = std_dev_values)
  
  # Identify outliers
  outlier_indices <- apply(abs(z_scores) > threshold, 1, any)
  
  # Remove outliers from the dataset
  cleaned_dataset <- dataset[!outlier_indices, ]
  
  return(cleaned_dataset)
}

In the preprocessing stage, removing outliers is of significant importance. Numerous techniques are available for this purpose, and in our project, we opted for the Z-score method. By leveraging the mean and standard deviation of the dataset, this method ensures a confidence level equivalent to 99.7%.

#######look at the cleandd data

data <- read.csv("/Users/zijiehe/Desktop/515Final/cleaned_dataset.csv")
depressed_counts <- table(data$depressed)

ggplot(data = as.data.frame(depressed_counts), aes(x = Var1, y = Freq, fill = as.factor(Var1))) +
  geom_bar(stat = "identity") +
  labs(x = "Depression Status", y = "Count", fill = "Depression Status") +
  ggtitle("Distribution of Depression Status in the CLEANED Dataset")

4.2 Balancing dataset

The given dataset is imbalanced, and it can introduce bias into model predictions, especially when one class dominates the others. So, balancing it before modeling becomes essential. Two techniques applied here : upsampling and downsampling.

4.2.1 Upsampling

In upsampling, the minority class is replicated or synthetically generate and increase the number of instances in the it.

upSampling <- function(cleaned_dataset) {
  set.seed(111)
  A <- upSample(x = cleaned_dataset[, -ncol(cleaned_dataset)], y = cleaned_dataset$depressed)
  names(A)[names(A) == "Class"] <- "depressed"
  return(A)
}

4.2.2 Downsampling

In downsampling, random instances from the majority class are removed or randomly subsampled and reduces the number of instances in the majority class to achieve a more balanced dataset.

DownSampling<-function(cleaned_dataset){
  set.seed(111)
  A<-downSample(x=cleaned_dataset[,-ncol(cleaned_dataset)],
                y=cleaned_dataset$depressed)
  names(A)[names(A) == "Class"] <- "depressed"
  return(A)
}
#######look at the down-sampling data

data <- read.csv("/Users/zijiehe/Desktop/515Final/data_up.csv")
depressed_counts <- table(data$depressed)

ggplot(data = as.data.frame(depressed_counts), aes(x = Var1, y = Freq, fill = as.factor(Var1))) +
  geom_bar(stat = "identity") +
  labs(x = "Depression Status", y = "Count", fill = "Depression Status") +
  ggtitle("Distribution of Depression Status in the down-sampling Dataset")

4.3 Feature Selection using PCA

PCA is the method which is used in preprocessing step. Since we have 35 attributes in this dataset, PCA applied over the whole dataset to selectet important features and reducing the dataset dimension by using linear variavles. The generared dataset from PCA then is passed for modeling.

##################################
##    PCA      
#################################
library(dplyr)

data <- read.csv("/Users/zijiehe/Desktop/STAT515FINAL/dep.csv") 

# One-Hot Encoding
model_data <- model.matrix(~ . - 1, data = data %>% select(-depressed))

# training set and testing set
set.seed(123)
split <- createDataPartition(data$depressed, p = 0.55, list = FALSE)
train_data <- model_data[split, ]
test_data <- model_data[-split, ]
train_labels <- data$depressed[split]
test_labels <- data$depressed[-split]

# Standardizing data for PCA
train_scaled <- scale(train_data)
test_scaled <- scale(test_data, center = attr(train_scaled, "scaled:center"), scale = attr(train_scaled, "scaled:scale"))

pca <- prcomp(train_scaled, center = TRUE, scale. = TRUE)

# Summary
summary(pca)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.5159 1.76735 1.62361 1.56583 1.39430 1.34675 1.22116
## Proportion of Variance 0.1862 0.09187 0.07753 0.07211 0.05718 0.05335 0.04386
## Cumulative Proportion  0.1862 0.27804 0.35557 0.42769 0.48487 0.53821 0.58207
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.13455 1.06340 1.03218 1.02295 0.95435 0.94237 0.92951
## Proportion of Variance 0.03786 0.03326 0.03133 0.03078 0.02679 0.02612 0.02541
## Cumulative Proportion  0.61993 0.65319 0.68452 0.71530 0.74209 0.76821 0.79362
##                           PC15   PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.90396 0.8882 0.81877 0.80343 0.78846 0.75410 0.74131
## Proportion of Variance 0.02403 0.0232 0.01972 0.01899 0.01828 0.01673 0.01616
## Cumulative Proportion  0.81765 0.8409 0.86057 0.87956 0.89784 0.91457 0.93073
##                           PC22    PC23    PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.68170 0.64688 0.60510 0.54098 0.48887 0.45130 0.42257
## Proportion of Variance 0.01367 0.01231 0.01077 0.00861 0.00703 0.00599 0.00525
## Cumulative Proportion  0.94440 0.95671 0.96748 0.97608 0.98311 0.98910 0.99436
##                           PC29    PC30    PC31    PC32     PC33      PC34
## Standard deviation     0.35686 0.22364 0.10839 0.05292 0.001526 2.008e-08
## Proportion of Variance 0.00375 0.00147 0.00035 0.00008 0.000000 0.000e+00
## Cumulative Proportion  0.99810 0.99957 0.99992 1.00000 1.000000 1.000e+00
# CHOSE THE NUM Of PCA
cumsum(prcomp(train_scaled)$sdev^2 / sum(prcomp(train_scaled)$sdev^2))
##  [1] 0.1861730 0.2780412 0.3555743 0.4276868 0.4848652 0.5382108 0.5820708
##  [8] 0.6199299 0.6531895 0.6845244 0.7153017 0.7420895 0.7682092 0.7936207
## [15] 0.8176543 0.8408564 0.8605737 0.8795590 0.8978435 0.9145692 0.9307322
## [22] 0.9444003 0.9567076 0.9674765 0.9760841 0.9831133 0.9891037 0.9943555
## [29] 0.9981010 0.9995720 0.9999176 0.9999999 1.0000000 1.0000000
k <- 15  

# Extract the principal components of the training and test sets
train_pca <- data.frame(pca$x[, 1:k])
test_pca <- data.frame(test_scaled %*% pca$rotation[, 1:k])

# Fitting logistic regression models to PCA-transformed data
train_pca$depressed <- train_labels
model_pca <- glm(depressed ~ ., data = train_pca, family = binomial)

# prediction
predictions_pca <- predict(model_pca, newdata = test_pca, type = "response")
predicted_classes_pca <- ifelse(predictions_pca > 0.5, "1", "0")

predicted_classes_pca <- as.factor(predicted_classes_pca)
test_labels <- as.factor(test_labels)

level_order <- sort(union(levels(predicted_classes_pca), levels(test_labels)))

predicted_classes_pca <- factor(predicted_classes_pca, levels = level_order)
test_labels <- factor(test_labels, levels = level_order)

# evaluation
confusionMatrix(predicted_classes_pca, test_labels)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 418  92
##          1   5   1
##                                           
##                Accuracy : 0.812           
##                  95% CI : (0.7756, 0.8448)
##     No Information Rate : 0.8198          
##     P-Value [Acc > NIR] : 0.7             
##                                           
##                   Kappa : -0.0017         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.98818         
##             Specificity : 0.01075         
##          Pos Pred Value : 0.81961         
##          Neg Pred Value : 0.16667         
##              Prevalence : 0.81977         
##          Detection Rate : 0.81008         
##    Detection Prevalence : 0.98837         
##       Balanced Accuracy : 0.49947         
##                                           
##        'Positive' Class : 0               
## 

4.3.2 PCA result analysis

1. Explanatory power of PCA: From the result given by the principal component analysis, the proportion of cumulative variance explained by the first 15 principal components is about 97.36%. This means that most of the information is contained by these 15 components, which, theoretically, is a better data downscaling.

2. model performance (using the properties of PCA):

Confusion matrix: the results show that the model predicted almost all the test samples to be in the negative category (non-depressed) and only one sample was predicted to be in the positive category (depressed), but this prediction was wrong. In fact, there should be 38 positive samples.

Accuracy and sensitivity: the overall accuracy of the model was 82.89%, but the specificity of the model was 0, indicating that it failed to correctly identify any truly depressed samples. This indicates a high rate of false positives.

3. problem analysis:

Data imbalance: the proportion of depressed and non-depressed samples in the dataset is severely imbalanced, and the model will be biased towards the majority class, resulting in high precision but low recall.

Influence of data: although PCA can reduce the dimensionality of the data, the correlation between the original variables is not high enough for PCA to effectively capture information useful for prediction. In addition, the inclusion of a large number of zero values in many variables may affect the effectiveness of PCA because these zeros may represent different meanings (e.g., not recorded or actual value of zero), thus distorting the intrinsic distribution of the data.PCA tends to emphasize variables with large variance. There are some variables in this dataset that have very high variance (e.g., agricultural income) and others that have relatively low variance, then in PCA the There are some variables in this dataset that have very high variance (e.g., agricultural income) and others that have relatively low variance, then in PCA the variables with large variance will have a large impact on the calculation of the principal components, resulting in these principal components reflecting mainly information from the variables with large variance and ignoring other variables that may be just as important, but with low variance.

4. Improve the methodology:

Dealing with the imbanlanced data: As we can see, the visualizations show us the dataset is imbalanced. We are considering use SMOTE or sampling techniques to balance the categories in this dataset, especially our target variable: depression status is highly imbalanced in categories.

Feature engineering: further analyze the variables, especially those containing a large number of 0 values, to understand the specific meaning of these 0 values and consider whether these variables need special treatment, such as variable transformation, filling in missing values, etc.

5.Model applications and Model Comparison

5.1 imbalanced dataset

5.1.1 Logistic Model

# Load necessary libraries
library(readxl)
library(caret)
library(dplyr)

# Read and prepare the dataset
read_data <- function(file_path) {
  data <- read_excel(file_path)
  print(paste("Number of rows:", nrow(data)))
  print(paste("Number of columns:", ncol(data)))
  print(paste("Number of missing values:", sum(is.na(data))))
  return(data)
}

# Remove outliers based on Z-score
remove_outliers <- function(dataset) {
  means <- apply(dataset, 2, mean, na.rm = TRUE)
  sds <- apply(dataset, 2, sd, na.rm = TRUE)
  z_scores <- abs(scale(dataset, center = means, scale = sds))
  dataset[apply(z_scores, 1, max, na.rm = TRUE) < 3, ]
}

# Logistic regression analysis on cleaned dataset
perform_logistic_regression <- function(cleaned_dataset) {
  # Setting up data for training and testing
  set.seed(123)
  training_indices <- createDataPartition(cleaned_dataset$depressed, p = 0.7, list = FALSE)
  train_data <- cleaned_dataset[training_indices, ]
  test_data <- cleaned_dataset[-training_indices, ]
  
  # Training the logistic regression model
  logistic_model <- glm(depressed ~ ., data = train_data, family = binomial())
  summary(logistic_model)
  
  # Predicting on the test data
  predictions <- predict(logistic_model, newdata = test_data, type = "response")
  predicted_classes <- factor(ifelse(predictions > 0.5, 1, 0), levels = c(0, 1))
  
  # Evaluating the model
  confusion_matrix <- confusionMatrix(predicted_classes, test_data$depressed)
  print(confusion_matrix)
}

# Main execution block
file_path <- "/Users/zijiehe/Desktop/STAT515FINAL/dep.xlsx"
data <- read_data(file_path)
## [1] "Number of rows: 1147"
## [1] "Number of columns: 35"
## [1] "Number of missing values: 0"
data <- remove_outliers(data)
data$depressed <- as.factor(data$depressed)
perform_logistic_regression(data)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 204  36
##          1   2   1
##                                           
##                Accuracy : 0.8436          
##                  95% CI : (0.7917, 0.8869)
##     No Information Rate : 0.8477          
##     P-Value [Acc > NIR] : 0.6129          
##                                           
##                   Kappa : 0.0278          
##                                           
##  Mcnemar's Test P-Value : 8.636e-08       
##                                           
##             Sensitivity : 0.99029         
##             Specificity : 0.02703         
##          Pos Pred Value : 0.85000         
##          Neg Pred Value : 0.33333         
##              Prevalence : 0.84774         
##          Detection Rate : 0.83951         
##    Detection Prevalence : 0.98765         
##       Balanced Accuracy : 0.50866         
##                                           
##        'Positive' Class : 0               
## 

5.1.1 Summary of Logistic model on imbalanced dataset

This logistic regression model performs well, has a nice accuracy (around 84.36%), but performs poorly in terms of its ability to distinguish between two categories (specifically, the positive category, i.e., 1). The accuracy of the model is similar to the No Information Rate (NIR), which indicates that the model does not significantly outperform random guessing in terms of predictive power. In addition, the model’s sensitivity in the positive category is extremely low and the specificity is very bad.

Accuracy: The accuracy is 84.36%, but this is because the imbalance in the data (most of the data belongs to category 0).

Positive Predictive Value and Negative Predictive Value: PPV is 85%, indicating that when the model predicts a positive category, the probability of being correct is 85%. BUT the NPV is very low which is 33.33%, which means the NP is very low. The model cannot distinguish the class ‘1’.

Sensitivity and Specificity: Sensitivity is as high as 99%, but Specificity is only 2.70%. This indicates that the model hardly recognizes the positive class correctly (CLass=1).

Kappa statistic: The Kappa value is only 0.0278 which indicates that the model has poor predictive power.

Mcnemar’s Test P-value: 8.636e-08, indicating that the model has significant bias in predicting positive and negative classes.

Balanced Accuracy: 50.87%, which further emphasizes the inadequacy of the model in handling unbalanced datasets.

5.1.2 Weighted Logistic model

# Load necessary libraries
library(caret)
library(dplyr)

# Function to read and prepare the dataset
read_data <- function(file_path) {
  data <- read_excel(file_path)
  print(paste("Number of rows:", nrow(data)))
  print(paste("Number of columns:", ncol(data)))
  print(paste("Number of missing values:", sum(is.na(data))))
  return(data)
}

# Function to remove outliers based on Z-score
remove_outliers <- function(dataset) {
  means <- apply(dataset, 2, mean, na.rm = TRUE)
  sds <- apply(dataset, 2, sd, na.rm = TRUE)
  z_scores <- abs(scale(dataset, center = means, scale = sds))
  dataset[apply(z_scores, 1, max, na.rm = TRUE) < 3, ]
}

# Function for weighted logistic regression analysis
perform_weighted_logistic_regression <- function(cleaned_dataset, weight) {
  # Setting up data for training and testing
  set.seed(123)
  training_indices <- createDataPartition(cleaned_dataset$depressed, p = 0.8, list = FALSE)
  train_data <- cleaned_dataset[training_indices, ]
  test_data <- cleaned_dataset[-training_indices, ]

  # Defining weights for the logistic regression model
  weights <- ifelse(train_data$depressed == 0, 1, weight)

  # Training the weighted logistic regression model
  weighted_model <- glm(depressed ~ ., data = train_data, family = binomial(), weights = weights)
  summary(weighted_model)
  
  # Predicting on the test data
  predictions <- predict(weighted_model, newdata = test_data, type = "response")
  predicted_classes <- factor(ifelse(predictions > 0.5, 1, 0), levels = c(0, 1))
  
  # Evaluating the model
  confusion_matrix <- confusionMatrix(predicted_classes, test_data$depressed)
  print(confusion_matrix)
  
  # Additional model evaluation
  aic <- AIC(weighted_model)
  print(paste("AIC:", aic))
}

# Main execution block
file_path <- "/Users/zijiehe/Desktop/STAT515FINAL/dep.xlsx"
data <- read_data(file_path)
## [1] "Number of rows: 1147"
## [1] "Number of columns: 35"
## [1] "Number of missing values: 0"
data <- remove_outliers(data)
data$depressed <- as.factor(data$depressed)
perform_weighted_logistic_regression(data, weight = 4)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 108  12
##          1  29  13
##                                           
##                Accuracy : 0.7469          
##                  95% CI : (0.6727, 0.8119)
##     No Information Rate : 0.8457          
##     P-Value [Acc > NIR] : 0.99961         
##                                           
##                   Kappa : 0.2413          
##                                           
##  Mcnemar's Test P-Value : 0.01246         
##                                           
##             Sensitivity : 0.7883          
##             Specificity : 0.5200          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.3095          
##              Prevalence : 0.8457          
##          Detection Rate : 0.6667          
##    Detection Prevalence : 0.7407          
##       Balanced Accuracy : 0.6542          
##                                           
##        'Positive' Class : 0               
##                                           
## [1] "AIC: 1238.59469568419"

5.1.2 Summary of Weighted Logistic model on imbalanced dataset

Accuracy: 74.69%, which is less than last one.

Predictive Ability: The model has a high prediction accuracy for the positive category (90%) but performs poorly for the negative category (NPV is only 30.95%), indicating that the model is less reliable in predicting the negative category. So this result is the similar to last one because of the imbalanced data.

Balanced Accuracy: 0.6542, indicating that the model is not very well when it encounters imbalanced dataset. Especially encoutner the negative class (class=1).

Mcnemar’s Test P-Value: 0.01246, indicating that the model has high bias when it deal with different classes.

Kappa: 0.2413, indicating that this model is much better than last one. Because this one get the weight into class ‘1’.

5.1.3 Decision Tree model

# Load necessary libraries
library(readxl)
library(caret)
library(dplyr)
library(rpart)
library(rattle)
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
# Function to read and prepare the dataset
read_data <- function(file_path) {
  data <- read_excel(file_path)
  print(paste("Number of rows:", nrow(data)))
  print(paste("Number of columns:", ncol(data)))
  print(paste("Number of missing values:", sum(is.na(data))))
  return(data)
}

# Function to remove outliers based on Z-score
remove_outliers <- function(dataset) {
  means <- apply(dataset, 2, mean, na.rm = TRUE)
  sds <- apply(dataset, 2, sd, na.rm = TRUE)
  z_scores <- abs(scale(dataset, center = means, scale = sds))
  dataset[apply(z_scores, 1, max, na.rm = TRUE) < 3, ]
}

# Function for decision tree analysis on imbalanced data
perform_decision_tree <- function(cleaned_dataset) {
  # Setting up data for training and testing
  set.seed(123)
  training_indices <- createDataPartition(cleaned_dataset$depressed, p = 0.7, list = FALSE)
  train_data <- cleaned_dataset[training_indices, ]
  test_data <- cleaned_dataset[-training_indices, ]

  # Training the decision tree model
  tree_model <- rpart(depressed ~ ., data = train_data, method = "class")

  # Plotting the decision tree
  fancyRpartPlot(tree_model, main="Decision Tree", sub = NULL)

  # Printing the summary of the trained decision tree
  print(summary(tree_model))

  # Predicting on the test set
  predicted_values <- predict(tree_model, newdata = test_data, type = "class")
  predicted_classes <- factor(predicted_values, levels = c(0, 1))
  actual_values <- factor(test_data$depressed, levels = c(0, 1))
  
  # Evaluating the model
  confusion_matrix <- confusionMatrix(predicted_classes, actual_values)
  print(confusion_matrix)
  
  return(list(model = tree_model, confusion_matrix = confusion_matrix))
}

# Main execution block
file_path <- "/Users/zijiehe/Desktop/STAT515FINAL/dep.xlsx"
data <- read_data(file_path)
## [1] "Number of rows: 1147"
## [1] "Number of columns: 35"
## [1] "Number of missing values: 0"
data <- remove_outliers(data)
data$depressed <- as.factor(data$depressed)
perform_decision_tree(data)

## Call:
## rpart(formula = depressed ~ ., data = train_data, method = "class")
##   n= 570 
## 
##           CP nsplit rel error   xerror       xstd
## 1 0.03030303      0 1.0000000 1.000000 0.09802678
## 2 0.02272727      3 0.9090909 1.079545 0.10110870
## 3 0.01136364      5 0.8636364 1.181818 0.10478269
## 4 0.01000000      6 0.8522727 1.261364 0.10743548
## 
## Variable importance
##           hh_children          years_of_edu    med_sickdays_hhave 
##                    14                    12                    11 
##           cons_social     fs_adwholed_often            cons_other 
##                    10                    10                     8 
##       cons_nondurable    durable_investment                   age 
##                     5                     5                     4 
##          cons_allfood nondurable_investment         asset_durable 
##                     4                     3                     3 
##     fs_chwholed_often        household_size         asset_savings 
##                     3                     2                     1 
##           fs_sleephun               fs_meat 
##                     1                     1 
## 
## Node number 1: 570 observations,    complexity param=0.03030303
##   predicted class=0  expected loss=0.154386  P(node) =1
##     class counts:   482    88
##    probabilities: 0.846 0.154 
##   left son=2 (477 obs) right son=3 (93 obs)
##   Primary splits:
##       years_of_edu      < 6.5        to the right, improve=4.107167, (0 missing)
##       fs_adwholed_often < 2          to the left,  improve=3.973162, (0 missing)
##       children          < 0.5        to the right, improve=2.285400, (0 missing)
##       age               < 66.5       to the left,  improve=2.021110, (0 missing)
##       asset_phone       < 15.93529   to the right, improve=1.665369, (0 missing)
##   Surrogate splits:
##       age               < 56.5       to the left,  agree=0.870, adj=0.204, (0 split)
##       household_size    < 1.5        to the right, agree=0.840, adj=0.022, (0 split)
##       cons_social       < 24.08979   to the left,  agree=0.840, adj=0.022, (0 split)
##       children          < 0.5        to the right, agree=0.839, adj=0.011, (0 split)
##       cons_med_children < 12.8123    to the left,  agree=0.839, adj=0.011, (0 split)
## 
## Node number 2: 477 observations,    complexity param=0.02272727
##   predicted class=0  expected loss=0.1278826  P(node) =0.8368421
##     class counts:   416    61
##    probabilities: 0.872 0.128 
##   left son=4 (253 obs) right son=5 (224 obs)
##   Primary splits:
##       cons_social        < 0.7273647  to the right, improve=1.8047670, (0 missing)
##       durable_investment < 234.8594   to the right, improve=1.6072560, (0 missing)
##       asset_phone        < 63.26071   to the left,  improve=1.3022130, (0 missing)
##       ed_expenses        < 45.6438    to the left,  improve=0.8842878, (0 missing)
##       fs_adwholed_often  < 2          to the left,  improve=0.8533597, (0 missing)
##   Surrogate splits:
##       cons_nondurable       < 46.84868   to the right, agree=0.878, adj=0.741, (0 split)
##       cons_other            < 3.483343   to the right, agree=0.878, adj=0.741, (0 split)
##       durable_investment    < 36.43497   to the right, agree=0.878, adj=0.741, (0 split)
##       nondurable_investment < 0.02780446 to the right, agree=0.878, adj=0.741, (0 split)
##       asset_durable         < 11.29084   to the right, agree=0.876, adj=0.737, (0 split)
## 
## Node number 3: 93 observations,    complexity param=0.03030303
##   predicted class=0  expected loss=0.2903226  P(node) =0.1631579
##     class counts:    66    27
##    probabilities: 0.710 0.290 
##   left son=6 (69 obs) right son=7 (24 obs)
##   Primary splits:
##       fs_adwholed_often     < 2          to the left,  improve=4.087073, (0 missing)
##       cons_social           < 1.621556   to the left,  improve=3.937965, (0 missing)
##       nondurable_investment < 4.569942   to the left,  improve=2.632348, (0 missing)
##       cons_alcohol          < 0.587257   to the right, improve=2.547581, (0 missing)
##       ent_total_cost        < 4.771246   to the left,  improve=2.439231, (0 missing)
##   Surrogate splits:
##       fs_chwholed_often < 2          to the left,  agree=0.785, adj=0.167, (0 split)
##       fs_sleephun       < 0.5        to the left,  agree=0.774, adj=0.125, (0 split)
##       fs_meat           < 0.5        to the right, agree=0.763, adj=0.083, (0 split)
##       cons_med_total    < 1.040999   to the left,  agree=0.753, adj=0.042, (0 split)
##       cons_ed           < 10.2098    to the left,  agree=0.753, adj=0.042, (0 split)
## 
## Node number 4: 253 observations
##   predicted class=0  expected loss=0.08695652  P(node) =0.4438596
##     class counts:   231    22
##    probabilities: 0.913 0.087 
## 
## Node number 5: 224 observations,    complexity param=0.02272727
##   predicted class=0  expected loss=0.1741071  P(node) =0.3929825
##     class counts:   185    39
##    probabilities: 0.826 0.174 
##   left son=10 (214 obs) right son=11 (10 obs)
##   Primary splits:
##       hh_children       < 4.5        to the left,  improve=5.789736, (0 missing)
##       fs_adwholed_often < 2          to the left,  improve=4.149148, (0 missing)
##       cons_ed           < 1.287903   to the left,  improve=3.827600, (0 missing)
##       ed_expenses       < 15.45483   to the left,  improve=3.535107, (0 missing)
##       cons_med_children < 0.4003842  to the right, improve=2.183618, (0 missing)
##   Surrogate splits:
##       asset_savings      < 28.82767   to the left,  agree=0.96, adj=0.1, (0 split)
##       fs_chwholed_often  < 2          to the left,  agree=0.96, adj=0.1, (0 split)
##       durable_investment < 843.3353   to the left,  agree=0.96, adj=0.1, (0 split)
## 
## Node number 6: 69 observations,    complexity param=0.01136364
##   predicted class=0  expected loss=0.2028986  P(node) =0.1210526
##     class counts:    55    14
##    probabilities: 0.797 0.203 
##   left son=12 (62 obs) right son=13 (7 obs)
##   Primary splits:
##       cons_other      < 34.5932    to the left,  improve=2.1160760, (0 missing)
##       cons_alcohol    < 0.587257   to the right, improve=0.9629084, (0 missing)
##       children        < 1.5        to the left,  improve=0.8641367, (0 missing)
##       years_of_edu    < 5.5        to the right, improve=0.8101365, (0 missing)
##       cons_nondurable < 241.4361   to the left,  improve=0.7934950, (0 missing)
##   Surrogate splits:
##       cons_nondurable < 266.7672   to the left,  agree=0.928, adj=0.286, (0 split)
##       cons_social     < 9.142107   to the left,  agree=0.928, adj=0.286, (0 split)
##       cons_allfood    < 215.6241   to the left,  agree=0.913, adj=0.143, (0 split)
## 
## Node number 7: 24 observations,    complexity param=0.03030303
##   predicted class=1  expected loss=0.4583333  P(node) =0.04210526
##     class counts:    11    13
##    probabilities: 0.458 0.542 
##   left son=14 (14 obs) right son=15 (10 obs)
##   Primary splits:
##       med_sickdays_hhave < 1.525      to the left,  improve=4.402381, (0 missing)
##       cons_social        < 1.354633   to the left,  improve=2.938889, (0 missing)
##       ed_schoolattend    < 0.8571429  to the right, improve=2.937646, (0 missing)
##       years_of_edu       < 4.5        to the left,  improve=2.288095, (0 missing)
##       ent_total_cost     < 4.771246   to the left,  improve=1.399184, (0 missing)
##   Surrogate splits:
##       cons_social    < 3.203074   to the left,  agree=0.750, adj=0.4, (0 split)
##       cons_allfood   < 32.61111   to the right, agree=0.708, adj=0.3, (0 split)
##       age            < 66         to the left,  agree=0.667, adj=0.2, (0 split)
##       household_size < 1.5        to the right, agree=0.667, adj=0.2, (0 split)
##       years_of_edu   < 3.5        to the left,  agree=0.667, adj=0.2, (0 split)
## 
## Node number 10: 214 observations
##   predicted class=0  expected loss=0.1495327  P(node) =0.3754386
##     class counts:   182    32
##    probabilities: 0.850 0.150 
## 
## Node number 11: 10 observations
##   predicted class=1  expected loss=0.3  P(node) =0.01754386
##     class counts:     3     7
##    probabilities: 0.300 0.700 
## 
## Node number 12: 62 observations
##   predicted class=0  expected loss=0.1612903  P(node) =0.1087719
##     class counts:    52    10
##    probabilities: 0.839 0.161 
## 
## Node number 13: 7 observations
##   predicted class=1  expected loss=0.4285714  P(node) =0.0122807
##     class counts:     3     4
##    probabilities: 0.429 0.571 
## 
## Node number 14: 14 observations
##   predicted class=0  expected loss=0.2857143  P(node) =0.0245614
##     class counts:    10     4
##    probabilities: 0.714 0.286 
## 
## Node number 15: 10 observations
##   predicted class=1  expected loss=0.1  P(node) =0.01754386
##     class counts:     1     9
##    probabilities: 0.100 0.900 
## 
## n= 570 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 570 88 0 (0.84561404 0.15438596)  
##    2) years_of_edu>=6.5 477 61 0 (0.87211740 0.12788260)  
##      4) cons_social>=0.7273647 253 22 0 (0.91304348 0.08695652) *
##      5) cons_social< 0.7273647 224 39 0 (0.82589286 0.17410714)  
##       10) hh_children< 4.5 214 32 0 (0.85046729 0.14953271) *
##       11) hh_children>=4.5 10  3 1 (0.30000000 0.70000000) *
##    3) years_of_edu< 6.5 93 27 0 (0.70967742 0.29032258)  
##      6) fs_adwholed_often< 2 69 14 0 (0.79710145 0.20289855)  
##       12) cons_other< 34.5932 62 10 0 (0.83870968 0.16129032) *
##       13) cons_other>=34.5932 7  3 1 (0.42857143 0.57142857) *
##      7) fs_adwholed_often>=2 24 11 1 (0.45833333 0.54166667)  
##       14) med_sickdays_hhave< 1.525 14  4 0 (0.71428571 0.28571429) *
##       15) med_sickdays_hhave>=1.525 10  1 1 (0.10000000 0.90000000) *
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 199  36
##          1   7   1
##                                           
##                Accuracy : 0.823           
##                  95% CI : (0.7691, 0.8689)
##     No Information Rate : 0.8477          
##     P-Value [Acc > NIR] : 0.8759          
##                                           
##                   Kappa : -0.0102         
##                                           
##  Mcnemar's Test P-Value : 1.955e-05       
##                                           
##             Sensitivity : 0.96602         
##             Specificity : 0.02703         
##          Pos Pred Value : 0.84681         
##          Neg Pred Value : 0.12500         
##              Prevalence : 0.84774         
##          Detection Rate : 0.81893         
##    Detection Prevalence : 0.96708         
##       Balanced Accuracy : 0.49652         
##                                           
##        'Positive' Class : 0               
## 
## $model
## n= 570 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 570 88 0 (0.84561404 0.15438596)  
##    2) years_of_edu>=6.5 477 61 0 (0.87211740 0.12788260)  
##      4) cons_social>=0.7273647 253 22 0 (0.91304348 0.08695652) *
##      5) cons_social< 0.7273647 224 39 0 (0.82589286 0.17410714)  
##       10) hh_children< 4.5 214 32 0 (0.85046729 0.14953271) *
##       11) hh_children>=4.5 10  3 1 (0.30000000 0.70000000) *
##    3) years_of_edu< 6.5 93 27 0 (0.70967742 0.29032258)  
##      6) fs_adwholed_often< 2 69 14 0 (0.79710145 0.20289855)  
##       12) cons_other< 34.5932 62 10 0 (0.83870968 0.16129032) *
##       13) cons_other>=34.5932 7  3 1 (0.42857143 0.57142857) *
##      7) fs_adwholed_often>=2 24 11 1 (0.45833333 0.54166667)  
##       14) med_sickdays_hhave< 1.525 14  4 0 (0.71428571 0.28571429) *
##       15) med_sickdays_hhave>=1.525 10  1 1 (0.10000000 0.90000000) *
## 
## $confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 199  36
##          1   7   1
##                                           
##                Accuracy : 0.823           
##                  95% CI : (0.7691, 0.8689)
##     No Information Rate : 0.8477          
##     P-Value [Acc > NIR] : 0.8759          
##                                           
##                   Kappa : -0.0102         
##                                           
##  Mcnemar's Test P-Value : 1.955e-05       
##                                           
##             Sensitivity : 0.96602         
##             Specificity : 0.02703         
##          Pos Pred Value : 0.84681         
##          Neg Pred Value : 0.12500         
##              Prevalence : 0.84774         
##          Detection Rate : 0.81893         
##    Detection Prevalence : 0.96708         
##       Balanced Accuracy : 0.49652         
##                                           
##        'Positive' Class : 0               
## 

5.1.3 Summary of Decision Tree model on imbalanced dataset

Based on the decision tree model and confusion matrix data, the model mainly predicted category 0 (non-depressed), but performed poorly for category 1 (depressed).(the similar result as above some models).

Accuracy: 82.3%, the model is highly accurate

Kappa statistic: Kappa is negative, indicating that the model has poor predictive power.

Sensitivity and specificity: the sensitivity was high (96.6%), indicating that the model was able to identify individuals with non-depressive symptoms well; however, the specificity was extremely low (2.7%), indicating that it was almost impossible to correctly identify individuals with true depressive symptoms.

Positive and negative predictive values: the positive predictive value is 84.68%, but the negative predictive value is 12.5%, indicating the model’s poor ability to predict class ‘1’.

5.2 Balanced dataset

5.2.1 Logistic model on balanced dataset (Upsampling)

# Load necessary libraries
library(readxl)
library(caret)
library(dplyr)

# Function to read and prepare the dataset
read_data <- function(file_path) {
  data <- read_excel(file_path)
  print(paste("Number of rows:", nrow(data)))
  print(paste("Number of columns:", ncol(data)))
  print(paste("Number of missing values:", sum(is.na(data))))
  return(data)
}

# Function to remove outliers based on Z-score
remove_outliers <- function(dataset) {
  means <- apply(dataset, 2, mean, na.rm = TRUE)
  sds <- apply(dataset, 2, sd, na.rm = TRUE)
  z_scores <- abs(scale(dataset, center = means, scale = sds))
  dataset[apply(z_scores, 1, max, na.rm = TRUE) < 3, ]
}

# Function for upsampling imbalanced data
upSampling <- function(dataset) {
  set.seed(111)
  upsampled_data <- upSample(x = dataset[, -ncol(dataset)], y = dataset$depressed)
  names(upsampled_data)[names(upsampled_data) == "Class"] <- "depressed"
  return(upsampled_data)
}

# Function for logistic regression analysis on balanced dataset
perform_logistic_regression <- function(balanced_dataset) {
  # Setting up data for training and testing
  set.seed(123)
  training_indices <- createDataPartition(balanced_dataset$depressed, p = 0.7, list = FALSE)
  train_data <- balanced_dataset[training_indices, ]
  test_data <- balanced_dataset[-training_indices, ]

  # Training the logistic regression model
  logistic_model <- glm(depressed ~ ., data = train_data, family = binomial())
  summary(logistic_model)
  
  # Predicting on the test data
  predictions <- predict(logistic_model, newdata = test_data, type = "response")
  predicted_classes <- factor(ifelse(predictions > 0.5, 1, 0), levels = c(0, 1))
  
  # Evaluating the model
  confusion_matrix <- confusionMatrix(predicted_classes, test_data$depressed)
  print(confusion_matrix)
}

# Main execution block
file_path <- "/Users/zijiehe/Desktop/STAT515FINAL/dep.xlsx"
data <- read_data(file_path)
## [1] "Number of rows: 1147"
## [1] "Number of columns: 35"
## [1] "Number of missing values: 0"
data <- remove_outliers(data)
data$depressed <- as.factor(data$depressed)
balanced_data <- upSampling(data)
perform_logistic_regression(balanced_data)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 142  86
##          1  64 120
##                                           
##                Accuracy : 0.6359          
##                  95% CI : (0.5874, 0.6825)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 1.884e-08       
##                                           
##                   Kappa : 0.2718          
##                                           
##  Mcnemar's Test P-Value : 0.08641         
##                                           
##             Sensitivity : 0.6893          
##             Specificity : 0.5825          
##          Pos Pred Value : 0.6228          
##          Neg Pred Value : 0.6522          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3447          
##    Detection Prevalence : 0.5534          
##       Balanced Accuracy : 0.6359          
##                                           
##        'Positive' Class : 0               
## 

5.2.1 Summary of Logistic model on balanced dataset

Summarizing the model performance: The overall performance of the model was good with an accuracy of 63.59% . But the predictive ability is relatively relatively good in unbalanced data. But the sensitivity and specificity performance is not good.

No Information Rate (NIR): 0.5, indicating balanced data categories.

P-Value [Acc > NIR]: 1.884e-08. This very small p-value indicates that the model is significantly more accurate than the prediction rate without any information.

Kappa: 0.2718, indicating that the model has some predictive power.

Mcnemar’s Test P-Value: 0.08641. This value is greater than 0.05, indicating that there is no significant bias between predicting positive and negative categories.

Sensitivity: 68.93%. This means that the model correctly identifies 68.93% of the actual positive classes, indicating that the model performs well in identifying actual positive classes.

Specificity: 58.25%. This means that the model correctly identifies 58.25% of the actual negative classes, indicating that the model has improved its performance relative to the previous model.

Pos Pred Value, PPV: 62.28%. This is the percentage of predicted positive categories that are actually positive, indicating that when the model predicts a sample to be positive, there is a 62.28% probability that it will be correct.

Neg Pred Value, NPV: 65.22%. This is the percentage of predicted negative categories that are actually negative, indicating that when the model predicts a sample to be negative, there is a 65.22% probability of being correct.

Balanced Accuracy: 63.59%. This is the average of the sensitivity and specificity. It indicates that the model is equally capable of predicting both categories.

5.2.2 Decision Tree model on balanced dataset (Upsampling)

# Load necessary libraries
library(readxl)
library(caret)
library(dplyr)
library(rpart)
library(rattle)

# Function to read and prepare the dataset
read_data <- function(file_path) {
  data <- read_excel(file_path)
  print(paste("Number of rows:", nrow(data)))
  print(paste("Number of columns:", ncol(data)))
  print(paste("Number of missing values:", sum(is.na(data))))
  return(data)
}

# Function to remove outliers based on Z-score
remove_outliers <- function(dataset) {
  means <- apply(dataset, 2, mean, na.rm = TRUE)
  sds <- apply(dataset, 2, sd, na.rm = TRUE)
  z_scores <- abs(scale(dataset, center = means, scale = sds))
  dataset[apply(z_scores, 1, max, na.rm = TRUE) < 3, ]
}

# Function for upsampling imbalanced data
upSampling <- function(dataset) {
  set.seed(111)
  upsampled_data <- upSample(x = dataset[, -ncol(dataset)], y = dataset$depressed)
  names(upsampled_data)[names(upsampled_data) == "Class"] <- "depressed"
  return(upsampled_data)
}

# Function for decision tree analysis on balanced dataset
perform_decision_tree <- function(balanced_dataset) {
  # Setting up data for training and testing
  set.seed(123)
  training_indices <- createDataPartition(balanced_dataset$depressed, p = 0.7, list = FALSE)
  train_data <- balanced_dataset[training_indices, ]
  test_data <- balanced_dataset[-training_indices, ]

  # Training the decision tree model
  tree_model <- rpart(depressed ~ ., data = train_data, method = "class")

  # Plotting the decision tree
  fancyRpartPlot(tree_model, main="Balanced Decision Tree", sub = NULL)

  # Printing the summary of the trained decision tree
  print(summary(tree_model))

  # Predicting on the test set
  predicted_values <- predict(tree_model, newdata = test_data, type = "class")
  predicted_classes <- factor(predicted_values, levels = c(0, 1))
  actual_values <- factor(test_data$depressed, levels = c(0, 1))
  
  # Evaluating the model
  confusion_matrix <- confusionMatrix(predicted_classes, actual_values)
  print(confusion_matrix)
  
  return(list(model = tree_model, confusion_matrix = confusion_matrix))
}

# Main execution block
file_path <- "/Users/zijiehe/Desktop/STAT515FINAL/dep.xlsx"
data <- read_data(file_path)
## [1] "Number of rows: 1147"
## [1] "Number of columns: 35"
## [1] "Number of missing values: 0"
data <- remove_outliers(data)
data$depressed <- as.factor(data$depressed)
balanced_data <- upSampling(data)
perform_decision_tree(balanced_data)

## Call:
## rpart(formula = depressed ~ ., data = train_data, method = "class")
##   n= 964 
## 
##            CP nsplit rel error    xerror       xstd
## 1  0.15560166      0 1.0000000 1.0788382 0.03210758
## 2  0.10788382      1 0.8443983 0.9107884 0.03207941
## 3  0.03941909      2 0.7365145 0.7842324 0.03144917
## 4  0.01867220      3 0.6970954 0.7406639 0.03110591
## 5  0.01659751      5 0.6597510 0.7385892 0.03108789
## 6  0.01556017      8 0.6058091 0.6867220 0.03058654
## 7  0.01452282     15 0.4854772 0.6867220 0.03058654
## 8  0.01244813     16 0.4709544 0.6473029 0.03013808
## 9  0.01175657     17 0.4585062 0.6307054 0.02993114
## 10 0.01000000     20 0.4232365 0.6078838 0.02962849
## 
## Variable importance
##           cons_social       cons_nondurable            cons_other 
##                    10                     8                     7 
##         asset_durable          cons_allfood     cons_med_children 
##                     6                     6                     6 
##    durable_investment     fs_chwholed_often    med_sickdays_hhave 
##                     5                     5                     5 
##               fs_meat              children           asset_phone 
##                     5                     4                     4 
##     fs_adwholed_often        household_size        ent_total_cost 
##                     4                     4                     3 
##        marital_status                   age nondurable_investment 
##                     3                     3                     3 
##           fs_sleephun          years_of_edu          cons_alcohol 
##                     2                     2                     2 
##        cons_med_total          cons_tobacco         asset_savings 
##                     1                     1                     1 
## 
## Node number 1: 964 observations,    complexity param=0.1556017
##   predicted class=0  expected loss=0.5  P(node) =1
##     class counts:   482   482
##    probabilities: 0.500 0.500 
##   left son=2 (755 obs) right son=3 (209 obs)
##   Primary splits:
##       fs_adwholed_often  < 2          to the left,  improve=17.18210, (0 missing)
##       med_sickdays_hhave < 1.469298   to the left,  improve=15.26727, (0 missing)
##       cons_social        < 3.22643    to the right, improve=14.05538, (0 missing)
##       durable_investment < 282.9833   to the right, improve=12.82277, (0 missing)
##       years_of_edu       < 6.5        to the right, improve=12.75755, (0 missing)
##   Surrogate splits:
##       fs_sleephun       < 0.5        to the left,  agree=0.859, adj=0.349, (0 split)
##       fs_chwholed_often < 0.9384058  to the left,  agree=0.856, adj=0.335, (0 split)
##       cons_social       < 20.57975   to the left,  agree=0.793, adj=0.043, (0 split)
##       cons_med_total    < 29.62843   to the left,  agree=0.789, adj=0.029, (0 split)
##       asset_savings     < 94.49069   to the left,  agree=0.785, adj=0.010, (0 split)
## 
## Node number 2: 755 observations,    complexity param=0.1078838
##   predicted class=0  expected loss=0.4503311  P(node) =0.783195
##     class counts:   415   340
##    probabilities: 0.550 0.450 
##   left son=4 (357 obs) right son=5 (398 obs)
##   Primary splits:
##       cons_social           < 0.08007685 to the right, improve=22.26146, (0 missing)
##       fs_meat               < 3.03444    to the left,  improve=21.27456, (0 missing)
##       nondurable_investment < 0.3447753  to the right, improve=17.67330, (0 missing)
##       ent_total_cost        < 0.07785249 to the right, improve=16.71426, (0 missing)
##       cons_nondurable       < 13.62031   to the right, improve=16.65899, (0 missing)
##   Surrogate splits:
##       asset_durable      < 12.0916    to the right, agree=0.959, adj=0.913, (0 split)
##       durable_investment < 20.90006   to the right, agree=0.958, adj=0.910, (0 split)
##       cons_nondurable    < 13.62031   to the right, agree=0.956, adj=0.908, (0 split)
##       cons_allfood       < 2.616797   to the right, agree=0.956, adj=0.908, (0 split)
##       cons_other         < 0.4804611  to the right, agree=0.955, adj=0.905, (0 split)
## 
## Node number 3: 209 observations,    complexity param=0.01659751
##   predicted class=1  expected loss=0.3205742  P(node) =0.216805
##     class counts:    67   142
##    probabilities: 0.321 0.679 
##   left son=6 (108 obs) right son=7 (101 obs)
##   Primary splits:
##       med_sickdays_hhave < 1.775      to the left,  improve=10.279040, (0 missing)
##       durable_investment < 283.7841   to the right, improve= 8.556608, (0 missing)
##       cons_other         < 44.05828   to the right, improve= 6.367930, (0 missing)
##       ent_total_cost     < 0.6517366  to the left,  improve= 5.469878, (0 missing)
##       years_of_edu       < 8.5        to the right, improve= 4.869325, (0 missing)
##   Surrogate splits:
##       cons_social           < 2.462363   to the left,  agree=0.646, adj=0.267, (0 split)
##       cons_med_total        < 14.01345   to the left,  agree=0.622, adj=0.218, (0 split)
##       cons_med_children     < 0.8808453  to the left,  agree=0.622, adj=0.218, (0 split)
##       marital_status        < 0.5        to the right, agree=0.617, adj=0.208, (0 split)
##       nondurable_investment < 40.25419   to the left,  agree=0.612, adj=0.198, (0 split)
## 
## Node number 4: 357 observations,    complexity param=0.03941909
##   predicted class=0  expected loss=0.3221289  P(node) =0.370332
##     class counts:   242   115
##    probabilities: 0.678 0.322 
##   left son=8 (284 obs) right son=9 (73 obs)
##   Primary splits:
##       asset_phone            < 63.26071   to the left,  improve=17.411140, (0 missing)
##       fs_meat                < 4.5        to the left,  improve=10.299610, (0 missing)
##       asset_land_owned_total < 1.3        to the right, improve= 8.888106, (0 missing)
##       asset_durable          < 523.8868   to the left,  improve= 7.820506, (0 missing)
##       cons_med_total         < 12.33183   to the left,  improve= 6.885786, (0 missing)
##   Surrogate splits:
##       cons_med_children     < 10.57014   to the left,  agree=0.821, adj=0.123, (0 split)
##       asset_durable         < 584.561    to the left,  agree=0.812, adj=0.082, (0 split)
##       nondurable_investment < 0.3447753  to the right, agree=0.812, adj=0.082, (0 split)
##       cons_nondurable       < 423.3387   to the left,  agree=0.810, adj=0.068, (0 split)
##       cons_other            < 77.35424   to the left,  agree=0.810, adj=0.068, (0 split)
## 
## Node number 5: 398 observations,    complexity param=0.01556017
##   predicted class=1  expected loss=0.4346734  P(node) =0.4128631
##     class counts:   173   225
##    probabilities: 0.435 0.565 
##   left son=10 (312 obs) right son=11 (86 obs)
##   Primary splits:
##       fs_chwholed_often  < 0.112069   to the right, improve=8.963033, (0 missing)
##       children           < 0.5        to the right, improve=7.398128, (0 missing)
##       cons_med_children  < 0.6780793  to the right, improve=7.359677, (0 missing)
##       marital_status     < 0.5        to the right, improve=7.297268, (0 missing)
##       med_sickdays_hhave < 3.75       to the left,  improve=5.508631, (0 missing)
##   Surrogate splits:
##       cons_med_children < 0.1501441  to the right, agree=0.987, adj=0.942, (0 split)
##       children          < 0.5        to the right, agree=0.925, adj=0.651, (0 split)
##       household_size    < 2.5        to the right, agree=0.894, adj=0.512, (0 split)
##       cons_nondurable   < 14.34405   to the left,  agree=0.862, adj=0.360, (0 split)
##       asset_durable     < 2.882766   to the left,  agree=0.862, adj=0.360, (0 split)
## 
## Node number 6: 108 observations,    complexity param=0.01659751
##   predicted class=1  expected loss=0.4722222  P(node) =0.1120332
##     class counts:    51    57
##    probabilities: 0.472 0.528 
##   left son=12 (16 obs) right son=13 (92 obs)
##   Primary splits:
##       cons_social    < 4.170669   to the right, improve=10.463770, (0 missing)
##       marital_status < 0.5        to the left,  improve= 8.233333, (0 missing)
##       ent_total_cost < 0.8713918  to the left,  improve= 7.724959, (0 missing)
##       fs_meat        < 3.5        to the right, improve= 6.699595, (0 missing)
##       asset_durable  < 61.01856   to the right, improve= 6.519048, (0 missing)
##   Surrogate splits:
##       cons_med_children  < 2.001921   to the right, agree=0.880, adj=0.188, (0 split)
##       cons_other         < 44.37859   to the right, agree=0.880, adj=0.188, (0 split)
##       med_sickdays_hhave < 1.669643   to the right, agree=0.880, adj=0.188, (0 split)
##       cons_med_total     < 2.322229   to the right, agree=0.870, adj=0.125, (0 split)
##       years_of_edu       < 11         to the right, agree=0.861, adj=0.063, (0 split)
## 
## Node number 7: 101 observations
##   predicted class=1  expected loss=0.1584158  P(node) =0.1047718
##     class counts:    16    85
##    probabilities: 0.158 0.842 
## 
## Node number 8: 284 observations,    complexity param=0.01175657
##   predicted class=0  expected loss=0.2429577  P(node) =0.2946058
##     class counts:   215    69
##    probabilities: 0.757 0.243 
##   left son=16 (117 obs) right son=17 (167 obs)
##   Primary splits:
##       ent_total_cost        < 3.359891   to the left,  improve=8.827632, (0 missing)
##       nondurable_investment < 3.705779   to the left,  improve=8.730911, (0 missing)
##       ent_nonag_revenue     < 333.1197   to the left,  improve=5.414585, (0 missing)
##       ed_expenses           < 15.61499   to the left,  improve=5.241596, (0 missing)
##       years_of_edu          < 5.5        to the right, improve=5.046073, (0 missing)
##   Surrogate splits:
##       nondurable_investment < 7.223599   to the left,  agree=0.866, adj=0.675, (0 split)
##       cons_nondurable       < 129.3255   to the left,  agree=0.683, adj=0.231, (0 split)
##       cons_other            < 13.70115   to the left,  agree=0.662, adj=0.179, (0 split)
##       durable_investment    < 140.7668   to the left,  agree=0.651, adj=0.154, (0 split)
##       cons_allfood          < 85.47331   to the left,  agree=0.648, adj=0.145, (0 split)
## 
## Node number 9: 73 observations,    complexity param=0.0186722
##   predicted class=1  expected loss=0.369863  P(node) =0.07572614
##     class counts:    27    46
##    probabilities: 0.370 0.630 
##   left son=18 (29 obs) right son=19 (44 obs)
##   Primary splits:
##       fs_meat      < 3.5        to the left,  improve=7.833040, (0 missing)
##       cons_social  < 3.269805   to the right, improve=6.299506, (0 missing)
##       fs_sleephun  < 0.5        to the right, improve=4.150348, (0 missing)
##       age          < 29.5       to the right, improve=4.109722, (0 missing)
##       years_of_edu < 9.5        to the left,  improve=3.661097, (0 missing)
##   Surrogate splits:
##       cons_other      < 24.34336   to the left,  agree=0.836, adj=0.586, (0 split)
##       fs_sleephun     < 0.5        to the right, agree=0.767, adj=0.414, (0 split)
##       cons_allfood    < 137.5272   to the left,  agree=0.753, adj=0.379, (0 split)
##       cons_nondurable < 92.18123   to the left,  agree=0.740, adj=0.345, (0 split)
##       asset_savings   < 3.203074   to the left,  agree=0.726, adj=0.310, (0 split)
## 
## Node number 10: 312 observations,    complexity param=0.01556017
##   predicted class=1  expected loss=0.4903846  P(node) =0.3236515
##     class counts:   153   159
##    probabilities: 0.490 0.510 
##   left son=20 (10 obs) right son=21 (302 obs)
##   Primary splits:
##       household_size    < 2.5        to the left,  improve=5.366149, (0 missing)
##       children          < 1.5        to the left,  improve=4.720085, (0 missing)
##       age               < 22.5       to the left,  improve=3.436915, (0 missing)
##       marital_status    < 0.5        to the right, improve=2.763880, (0 missing)
##       cons_med_children < 1.525836   to the right, improve=2.550214, (0 missing)
## 
## Node number 11: 86 observations
##   predicted class=1  expected loss=0.2325581  P(node) =0.08921162
##     class counts:    20    66
##    probabilities: 0.233 0.767 
## 
## Node number 12: 16 observations
##   predicted class=0  expected loss=0  P(node) =0.01659751
##     class counts:    16     0
##    probabilities: 1.000 0.000 
## 
## Node number 13: 92 observations,    complexity param=0.01659751
##   predicted class=1  expected loss=0.3804348  P(node) =0.09543568
##     class counts:    35    57
##    probabilities: 0.380 0.620 
##   left son=26 (10 obs) right son=27 (82 obs)
##   Primary splits:
##       marital_status        < 0.5        to the left,  improve=8.613468, (0 missing)
##       cons_allfood          < 65.76883   to the left,  improve=7.987229, (0 missing)
##       ent_total_cost        < 0.8713918  to the left,  improve=6.309825, (0 missing)
##       age                   < 50         to the right, improve=5.816624, (0 missing)
##       nondurable_investment < 1.782266   to the left,  improve=5.816624, (0 missing)
##   Surrogate splits:
##       age                    < 53         to the right, agree=0.935, adj=0.4, (0 split)
##       asset_land_owned_total < 2.495      to the right, agree=0.913, adj=0.2, (0 split)
## 
## Node number 16: 117 observations
##   predicted class=0  expected loss=0.09401709  P(node) =0.1213693
##     class counts:   106    11
##    probabilities: 0.906 0.094 
## 
## Node number 17: 167 observations,    complexity param=0.01175657
##   predicted class=0  expected loss=0.3473054  P(node) =0.1732365
##     class counts:   109    58
##    probabilities: 0.653 0.347 
##   left son=34 (125 obs) right son=35 (42 obs)
##   Primary splits:
##       years_of_edu       < 7.5        to the right, improve=6.898480, (0 missing)
##       cons_nondurable    < 53.34834   to the right, improve=5.580367, (0 missing)
##       durable_investment < 191.5102   to the right, improve=4.503051, (0 missing)
##       ent_total_cost     < 16.04206   to the right, improve=4.358415, (0 missing)
##       asset_phone        < 41.63996   to the right, improve=4.268866, (0 missing)
##   Surrogate splits:
##       cons_nondurable    < 66.13776   to the right, agree=0.802, adj=0.214, (0 split)
##       cons_allfood       < 38.47273   to the right, agree=0.802, adj=0.214, (0 split)
##       age                < 42         to the left,  agree=0.796, adj=0.190, (0 split)
##       cons_tobacco       < 1.014088   to the left,  agree=0.790, adj=0.167, (0 split)
##       durable_investment < 100.6285   to the right, agree=0.784, adj=0.143, (0 split)
## 
## Node number 18: 29 observations,    complexity param=0.0186722
##   predicted class=0  expected loss=0.3448276  P(node) =0.03008299
##     class counts:    19    10
##    probabilities: 0.655 0.345 
##   left son=36 (18 obs) right son=37 (11 obs)
##   Primary splits:
##       fs_meat       < 1.5        to the right, improve=11.285270, (0 missing)
##       cons_social   < 2.602498   to the right, improve= 6.436782, (0 missing)
##       cons_other    < 20.54772   to the right, improve= 5.603448, (0 missing)
##       ed_expenses   < 20.65983   to the left,  improve= 4.214559, (0 missing)
##       asset_durable < 251.7616   to the right, improve= 3.629764, (0 missing)
##   Surrogate splits:
##       cons_social        < 2.602498   to the right, agree=0.862, adj=0.636, (0 split)
##       cons_nondurable    < 84.25133   to the right, agree=0.759, adj=0.364, (0 split)
##       asset_durable      < 153.0269   to the right, agree=0.759, adj=0.364, (0 split)
##       cons_other         < 20.54772   to the right, agree=0.759, adj=0.364, (0 split)
##       med_sickdays_hhave < 3.925      to the left,  agree=0.759, adj=0.364, (0 split)
## 
## Node number 19: 44 observations
##   predicted class=1  expected loss=0.1818182  P(node) =0.04564315
##     class counts:     8    36
##    probabilities: 0.182 0.818 
## 
## Node number 20: 10 observations
##   predicted class=0  expected loss=0  P(node) =0.01037344
##     class counts:    10     0
##    probabilities: 1.000 0.000 
## 
## Node number 21: 302 observations,    complexity param=0.01556017
##   predicted class=1  expected loss=0.4735099  P(node) =0.313278
##     class counts:   143   159
##    probabilities: 0.474 0.526 
##   left son=42 (244 obs) right son=43 (58 obs)
##   Primary splits:
##       marital_status    < 0.5        to the right, improve=4.672824, (0 missing)
##       age               < 22.5       to the left,  improve=2.490137, (0 missing)
##       children          < 6.5        to the left,  improve=2.361525, (0 missing)
##       fs_chwholed_often < 0.4395492  to the right, improve=1.611502, (0 missing)
##       years_of_edu      < 7.5        to the right, improve=1.274011, (0 missing)
##   Surrogate splits:
##       years_of_edu < 5.5        to the right, agree=0.821, adj=0.069, (0 split)
## 
## Node number 26: 10 observations
##   predicted class=0  expected loss=0  P(node) =0.01037344
##     class counts:    10     0
##    probabilities: 1.000 0.000 
## 
## Node number 27: 82 observations,    complexity param=0.01452282
##   predicted class=1  expected loss=0.304878  P(node) =0.08506224
##     class counts:    25    57
##    probabilities: 0.305 0.695 
##   left son=54 (13 obs) right son=55 (69 obs)
##   Primary splits:
##       ent_total_cost  < 0.8713918  to the left,  improve=6.662452, (0 missing)
##       cons_allfood    < 65.76883   to the left,  improve=5.572647, (0 missing)
##       cons_social     < 1.000961   to the right, improve=4.943337, (0 missing)
##       cons_nondurable < 73.02399   to the left,  improve=3.715162, (0 missing)
##       asset_durable   < 61.01856   to the right, improve=3.637501, (0 missing)
##   Surrogate splits:
##       nondurable_investment < 2.008594   to the left,  agree=0.927, adj=0.538, (0 split)
##       cons_nondurable       < 41.84797   to the left,  agree=0.890, adj=0.308, (0 split)
##       fs_meat               < 0.5        to the left,  agree=0.878, adj=0.231, (0 split)
## 
## Node number 34: 125 observations
##   predicted class=0  expected loss=0.264  P(node) =0.129668
##     class counts:    92    33
##    probabilities: 0.736 0.264 
## 
## Node number 35: 42 observations,    complexity param=0.01175657
##   predicted class=1  expected loss=0.4047619  P(node) =0.04356846
##     class counts:    17    25
##    probabilities: 0.405 0.595 
##   left son=70 (9 obs) right son=71 (33 obs)
##   Primary splits:
##       cons_alcohol   < 0.587257   to the right, improve=8.116883, (0 missing)
##       ent_total_cost < 10.20757   to the right, improve=5.418873, (0 missing)
##       cons_ed        < 2.335575   to the left,  improve=4.132326, (0 missing)
##       cons_tobacco   < 0.6978126  to the right, improve=4.004762, (0 missing)
##       ed_expenses    < 31.87059   to the left,  improve=3.569903, (0 missing)
##   Surrogate splits:
##       cons_tobacco       < 0.6978126  to the right, agree=0.833, adj=0.222, (0 split)
##       cons_ed            < 0.6673071  to the left,  agree=0.833, adj=0.222, (0 split)
##       med_sickdays_hhave < 4.45       to the right, agree=0.833, adj=0.222, (0 split)
##       ed_expenses        < 8.007685   to the left,  agree=0.833, adj=0.222, (0 split)
##       durable_investment < 84.29773   to the left,  agree=0.833, adj=0.222, (0 split)
## 
## Node number 36: 18 observations
##   predicted class=0  expected loss=0  P(node) =0.0186722
##     class counts:    18     0
##    probabilities: 1.000 0.000 
## 
## Node number 37: 11 observations
##   predicted class=1  expected loss=0.09090909  P(node) =0.01141079
##     class counts:     1    10
##    probabilities: 0.091 0.909 
## 
## Node number 42: 244 observations,    complexity param=0.01556017
##   predicted class=0  expected loss=0.4836066  P(node) =0.253112
##     class counts:   126   118
##    probabilities: 0.516 0.484 
##   left son=84 (212 obs) right son=85 (32 obs)
##   Primary splits:
##       children          < 5.5        to the left,  improve=3.062249, (0 missing)
##       age               < 24.5       to the left,  improve=2.760817, (0 missing)
##       fs_chwholed_often < 0.4395492  to the right, improve=2.287265, (0 missing)
##       years_of_edu      < 4.5        to the left,  improve=2.127327, (0 missing)
##       cons_med_children < 2.100913   to the right, improve=1.663826, (0 missing)
##   Surrogate splits:
##       cons_med_children  < 1.136585   to the right, agree=0.963, adj=0.719, (0 split)
##       household_size     < 7.5        to the left,  agree=0.947, adj=0.594, (0 split)
##       med_sickdays_hhave < 1.295687   to the right, agree=0.947, adj=0.594, (0 split)
##       fs_chwholed_often  < 0.5664414  to the left,  agree=0.881, adj=0.094, (0 split)
## 
## Node number 43: 58 observations
##   predicted class=1  expected loss=0.2931034  P(node) =0.06016598
##     class counts:    17    41
##    probabilities: 0.293 0.707 
## 
## Node number 54: 13 observations
##   predicted class=0  expected loss=0.2307692  P(node) =0.01348548
##     class counts:    10     3
##    probabilities: 0.769 0.231 
## 
## Node number 55: 69 observations
##   predicted class=1  expected loss=0.2173913  P(node) =0.07157676
##     class counts:    15    54
##    probabilities: 0.217 0.783 
## 
## Node number 70: 9 observations
##   predicted class=0  expected loss=0  P(node) =0.0093361
##     class counts:     9     0
##    probabilities: 1.000 0.000 
## 
## Node number 71: 33 observations
##   predicted class=1  expected loss=0.2424242  P(node) =0.03423237
##     class counts:     8    25
##    probabilities: 0.242 0.758 
## 
## Node number 84: 212 observations,    complexity param=0.01556017
##   predicted class=0  expected loss=0.4528302  P(node) =0.219917
##     class counts:   116    96
##    probabilities: 0.547 0.453 
##   left son=168 (56 obs) right son=169 (156 obs)
##   Primary splits:
##       household_size     < 5.5        to the right, improve=3.390853, (0 missing)
##       fs_chwholed_often  < 0.4288448  to the right, improve=3.227228, (0 missing)
##       years_of_edu       < 13.5       to the right, improve=2.968799, (0 missing)
##       med_sickdays_hhave < 1.525668   to the left,  improve=2.657614, (0 missing)
##       age                < 31.5       to the right, improve=2.086907, (0 missing)
##   Surrogate splits:
##       children           < 3.5        to the right, agree=0.934, adj=0.750, (0 split)
##       med_sickdays_hhave < 1.525668   to the left,  agree=0.906, adj=0.643, (0 split)
##       fs_chwholed_often  < 0.4288448  to the right, agree=0.877, adj=0.536, (0 split)
##       cons_med_children  < 1.238365   to the left,  agree=0.858, adj=0.464, (0 split)
##       age                < 38.5       to the right, agree=0.811, adj=0.286, (0 split)
## 
## Node number 85: 32 observations
##   predicted class=1  expected loss=0.3125  P(node) =0.03319502
##     class counts:    10    22
##    probabilities: 0.312 0.688 
## 
## Node number 168: 56 observations
##   predicted class=0  expected loss=0.3035714  P(node) =0.05809129
##     class counts:    39    17
##    probabilities: 0.696 0.304 
## 
## Node number 169: 156 observations,    complexity param=0.01556017
##   predicted class=1  expected loss=0.4935897  P(node) =0.1618257
##     class counts:    77    79
##    probabilities: 0.494 0.506 
##   left son=338 (69 obs) right son=339 (87 obs)
##   Primary splits:
##       age               < 24.5       to the left,  improve=4.1560950, (0 missing)
##       children          < 1.5        to the left,  improve=3.0010680, (0 missing)
##       cons_med_children < 2.100913   to the right, improve=3.0010680, (0 missing)
##       fs_chwholed_often < 0.526824   to the right, improve=3.0010680, (0 missing)
##       years_of_edu      < 11.5       to the left,  improve=0.8312297, (0 missing)
##   Surrogate splits:
##       household_size     < 4.5        to the left,  agree=0.628, adj=0.159, (0 split)
##       med_sickdays_hhave < 1.6844     to the right, agree=0.628, adj=0.159, (0 split)
##       years_of_edu       < 8.5        to the left,  agree=0.622, adj=0.145, (0 split)
##       children           < 3.5        to the right, agree=0.583, adj=0.058, (0 split)
##       cons_med_children  < 1.238365   to the left,  agree=0.583, adj=0.058, (0 split)
## 
## Node number 338: 69 observations,    complexity param=0.01244813
##   predicted class=0  expected loss=0.3768116  P(node) =0.07157676
##     class counts:    43    26
##    probabilities: 0.623 0.377 
##   left son=676 (55 obs) right son=677 (14 obs)
##   Primary splits:
##       age                < 17.5       to the right, improve=4.0006020, (0 missing)
##       years_of_edu       < 8.5        to the right, improve=2.4492750, (0 missing)
##       med_sickdays_hhave < 1.6844     to the left,  improve=1.7536230, (0 missing)
##       household_size     < 4.5        to the right, improve=1.7536230, (0 missing)
##       fs_chwholed_often  < 0.4288448  to the left,  improve=0.7443551, (0 missing)
##   Surrogate splits:
##       children           < 3.5        to the left,  agree=0.855, adj=0.286, (0 split)
##       cons_med_children  < 1.238365   to the right, agree=0.855, adj=0.286, (0 split)
##       household_size     < 3.5        to the right, agree=0.826, adj=0.143, (0 split)
##       med_sickdays_hhave < 2.125778   to the left,  agree=0.826, adj=0.143, (0 split)
## 
## Node number 339: 87 observations,    complexity param=0.01556017
##   predicted class=1  expected loss=0.3908046  P(node) =0.09024896
##     class counts:    34    53
##    probabilities: 0.391 0.609 
##   left son=678 (9 obs) right son=679 (78 obs)
##   Primary splits:
##       fs_chwholed_often < 0.4288448  to the right, improve=7.450928, (0 missing)
##       children          < 1.5        to the left,  improve=5.650287, (0 missing)
##       cons_med_children < 2.100913   to the right, improve=5.650287, (0 missing)
##       age               < 32         to the right, improve=3.321839, (0 missing)
##       years_of_edu      < 12.5       to the right, improve=1.088692, (0 missing)
##   Surrogate splits:
##       children          < 1.5        to the left,  agree=0.977, adj=0.778, (0 split)
##       cons_med_children < 2.100913   to the right, agree=0.977, adj=0.778, (0 split)
## 
## Node number 676: 55 observations
##   predicted class=0  expected loss=0.2909091  P(node) =0.05705394
##     class counts:    39    16
##    probabilities: 0.709 0.291 
## 
## Node number 677: 14 observations
##   predicted class=1  expected loss=0.2857143  P(node) =0.01452282
##     class counts:     4    10
##    probabilities: 0.286 0.714 
## 
## Node number 678: 9 observations
##   predicted class=0  expected loss=0  P(node) =0.0093361
##     class counts:     9     0
##    probabilities: 1.000 0.000 
## 
## Node number 679: 78 observations
##   predicted class=1  expected loss=0.3205128  P(node) =0.08091286
##     class counts:    25    53
##    probabilities: 0.321 0.679 
## 
## n= 964 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 964 482 0 (0.50000000 0.50000000)  
##     2) fs_adwholed_often< 2 755 340 0 (0.54966887 0.45033113)  
##       4) cons_social>=0.08007685 357 115 0 (0.67787115 0.32212885)  
##         8) asset_phone< 63.26071 284  69 0 (0.75704225 0.24295775)  
##          16) ent_total_cost< 3.359891 117  11 0 (0.90598291 0.09401709) *
##          17) ent_total_cost>=3.359891 167  58 0 (0.65269461 0.34730539)  
##            34) years_of_edu>=7.5 125  33 0 (0.73600000 0.26400000) *
##            35) years_of_edu< 7.5 42  17 1 (0.40476190 0.59523810)  
##              70) cons_alcohol>=0.587257 9   0 0 (1.00000000 0.00000000) *
##              71) cons_alcohol< 0.587257 33   8 1 (0.24242424 0.75757576) *
##         9) asset_phone>=63.26071 73  27 1 (0.36986301 0.63013699)  
##          18) fs_meat< 3.5 29  10 0 (0.65517241 0.34482759)  
##            36) fs_meat>=1.5 18   0 0 (1.00000000 0.00000000) *
##            37) fs_meat< 1.5 11   1 1 (0.09090909 0.90909091) *
##          19) fs_meat>=3.5 44   8 1 (0.18181818 0.81818182) *
##       5) cons_social< 0.08007685 398 173 1 (0.43467337 0.56532663)  
##        10) fs_chwholed_often>=0.112069 312 153 1 (0.49038462 0.50961538)  
##          20) household_size< 2.5 10   0 0 (1.00000000 0.00000000) *
##          21) household_size>=2.5 302 143 1 (0.47350993 0.52649007)  
##            42) marital_status>=0.5 244 118 0 (0.51639344 0.48360656)  
##              84) children< 5.5 212  96 0 (0.54716981 0.45283019)  
##               168) household_size>=5.5 56  17 0 (0.69642857 0.30357143) *
##               169) household_size< 5.5 156  77 1 (0.49358974 0.50641026)  
##                 338) age< 24.5 69  26 0 (0.62318841 0.37681159)  
##                   676) age>=17.5 55  16 0 (0.70909091 0.29090909) *
##                   677) age< 17.5 14   4 1 (0.28571429 0.71428571) *
##                 339) age>=24.5 87  34 1 (0.39080460 0.60919540)  
##                   678) fs_chwholed_often>=0.4288448 9   0 0 (1.00000000 0.00000000) *
##                   679) fs_chwholed_often< 0.4288448 78  25 1 (0.32051282 0.67948718) *
##              85) children>=5.5 32  10 1 (0.31250000 0.68750000) *
##            43) marital_status< 0.5 58  17 1 (0.29310345 0.70689655) *
##        11) fs_chwholed_often< 0.112069 86  20 1 (0.23255814 0.76744186) *
##     3) fs_adwholed_often>=2 209  67 1 (0.32057416 0.67942584)  
##       6) med_sickdays_hhave< 1.775 108  51 1 (0.47222222 0.52777778)  
##        12) cons_social>=4.170669 16   0 0 (1.00000000 0.00000000) *
##        13) cons_social< 4.170669 92  35 1 (0.38043478 0.61956522)  
##          26) marital_status< 0.5 10   0 0 (1.00000000 0.00000000) *
##          27) marital_status>=0.5 82  25 1 (0.30487805 0.69512195)  
##            54) ent_total_cost< 0.8713918 13   3 0 (0.76923077 0.23076923) *
##            55) ent_total_cost>=0.8713918 69  15 1 (0.21739130 0.78260870) *
##       7) med_sickdays_hhave>=1.775 101  16 1 (0.15841584 0.84158416) *
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 138  52
##          1  68 154
##                                           
##                Accuracy : 0.7087          
##                  95% CI : (0.6623, 0.7522)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4175          
##                                           
##  Mcnemar's Test P-Value : 0.1709          
##                                           
##             Sensitivity : 0.6699          
##             Specificity : 0.7476          
##          Pos Pred Value : 0.7263          
##          Neg Pred Value : 0.6937          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3350          
##    Detection Prevalence : 0.4612          
##       Balanced Accuracy : 0.7087          
##                                           
##        'Positive' Class : 0               
## 
## $model
## n= 964 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 964 482 0 (0.50000000 0.50000000)  
##     2) fs_adwholed_often< 2 755 340 0 (0.54966887 0.45033113)  
##       4) cons_social>=0.08007685 357 115 0 (0.67787115 0.32212885)  
##         8) asset_phone< 63.26071 284  69 0 (0.75704225 0.24295775)  
##          16) ent_total_cost< 3.359891 117  11 0 (0.90598291 0.09401709) *
##          17) ent_total_cost>=3.359891 167  58 0 (0.65269461 0.34730539)  
##            34) years_of_edu>=7.5 125  33 0 (0.73600000 0.26400000) *
##            35) years_of_edu< 7.5 42  17 1 (0.40476190 0.59523810)  
##              70) cons_alcohol>=0.587257 9   0 0 (1.00000000 0.00000000) *
##              71) cons_alcohol< 0.587257 33   8 1 (0.24242424 0.75757576) *
##         9) asset_phone>=63.26071 73  27 1 (0.36986301 0.63013699)  
##          18) fs_meat< 3.5 29  10 0 (0.65517241 0.34482759)  
##            36) fs_meat>=1.5 18   0 0 (1.00000000 0.00000000) *
##            37) fs_meat< 1.5 11   1 1 (0.09090909 0.90909091) *
##          19) fs_meat>=3.5 44   8 1 (0.18181818 0.81818182) *
##       5) cons_social< 0.08007685 398 173 1 (0.43467337 0.56532663)  
##        10) fs_chwholed_often>=0.112069 312 153 1 (0.49038462 0.50961538)  
##          20) household_size< 2.5 10   0 0 (1.00000000 0.00000000) *
##          21) household_size>=2.5 302 143 1 (0.47350993 0.52649007)  
##            42) marital_status>=0.5 244 118 0 (0.51639344 0.48360656)  
##              84) children< 5.5 212  96 0 (0.54716981 0.45283019)  
##               168) household_size>=5.5 56  17 0 (0.69642857 0.30357143) *
##               169) household_size< 5.5 156  77 1 (0.49358974 0.50641026)  
##                 338) age< 24.5 69  26 0 (0.62318841 0.37681159)  
##                   676) age>=17.5 55  16 0 (0.70909091 0.29090909) *
##                   677) age< 17.5 14   4 1 (0.28571429 0.71428571) *
##                 339) age>=24.5 87  34 1 (0.39080460 0.60919540)  
##                   678) fs_chwholed_often>=0.4288448 9   0 0 (1.00000000 0.00000000) *
##                   679) fs_chwholed_often< 0.4288448 78  25 1 (0.32051282 0.67948718) *
##              85) children>=5.5 32  10 1 (0.31250000 0.68750000) *
##            43) marital_status< 0.5 58  17 1 (0.29310345 0.70689655) *
##        11) fs_chwholed_often< 0.112069 86  20 1 (0.23255814 0.76744186) *
##     3) fs_adwholed_often>=2 209  67 1 (0.32057416 0.67942584)  
##       6) med_sickdays_hhave< 1.775 108  51 1 (0.47222222 0.52777778)  
##        12) cons_social>=4.170669 16   0 0 (1.00000000 0.00000000) *
##        13) cons_social< 4.170669 92  35 1 (0.38043478 0.61956522)  
##          26) marital_status< 0.5 10   0 0 (1.00000000 0.00000000) *
##          27) marital_status>=0.5 82  25 1 (0.30487805 0.69512195)  
##            54) ent_total_cost< 0.8713918 13   3 0 (0.76923077 0.23076923) *
##            55) ent_total_cost>=0.8713918 69  15 1 (0.21739130 0.78260870) *
##       7) med_sickdays_hhave>=1.775 101  16 1 (0.15841584 0.84158416) *
## 
## $confusion_matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 138  52
##          1  68 154
##                                           
##                Accuracy : 0.7087          
##                  95% CI : (0.6623, 0.7522)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4175          
##                                           
##  Mcnemar's Test P-Value : 0.1709          
##                                           
##             Sensitivity : 0.6699          
##             Specificity : 0.7476          
##          Pos Pred Value : 0.7263          
##          Neg Pred Value : 0.6937          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3350          
##    Detection Prevalence : 0.4612          
##       Balanced Accuracy : 0.7087          
##                                           
##        'Positive' Class : 0               
## 

5.2.2 Summary of Decision Tree model on balanced dataset (Upsampling)

Summarize the model performance: The results show an accuracy of 70.87%, which indicates that the model performs well in distinguishing between the two categories (depressed and non-depressed).The Kappa statistic is 0.4175, which indicates that the model’s predictive power is relatively good. The model showed some validity when dealing with a balanced dataset.

kappa: 0.4175. kappa values between 0.4 and 0.6 indicate that the model has moderate predictive consistency.

Mcnemar’s Test P-Value: 0.1709, which is higher than 0.05, indicating that the difference between the predictions of the positive and negative categories is not statistically significant, and the model is more balanced in predicting the two categories.

Sensitivity: 66.99%. Indicates that the model correctly identifies approximately 67% of non-depressed instances.

Specificity: 74.76%. Indicates that the model correctly identifies approximately 75% of the instances of depression.

Balanced Accuracy: 70.87%. Indicates that the model has excellent performance in handling both categories.

Variable Importance and Split of the Model

Main splits of the decision tree: The model is first split based on “fs_adwholed_often” (Frequency of purchasing full-price food items on a regular basis), which suggests that household food status is an important factor influencing depressive status. Next, health and social factors such as “cons_social” and “med_sickdays_have” are also used as decision nodes.

5.3.1 Logistic model on balanced dataset (down sampling)

# Load necessary libraries
library(readxl)
library(caret)
library(dplyr)

# Function to read and prepare the dataset
read_data <- function(file_path) {
  data <- read_excel(file_path)
  print(paste("Number of rows:", nrow(data)))
  print(paste("Number of columns:", ncol(data)))
  print(paste("Number of missing values:", sum(is.na(data))))
  return(data)
}

# Function to remove outliers based on Z-score
remove_outliers <- function(dataset) {
  means <- apply(dataset, 2, mean, na.rm = TRUE)
  sds <- apply(dataset, 2, sd, na.rm = TRUE)
  z_scores <- abs(scale(dataset, center = means, scale = sds))
  dataset[apply(z_scores, 1, max, na.rm = TRUE) < 3, ]
}

# Function for downsampling imbalanced data
downSampling <- function(dataset) {
  set.seed(111)
  downsampled_data <- downSample(x = dataset[, -ncol(dataset)], y = dataset$depressed)
  names(downsampled_data)[names(downsampled_data) == "Class"] <- "depressed"
  return(downsampled_data)
}

# Function for logistic regression analysis on balanced dataset
perform_logistic_regression <- function(balanced_dataset) {
  # Setting up data for training and testing
  set.seed(123)
  training_indices <- createDataPartition(balanced_dataset$depressed, p = 0.7, list = FALSE)
  train_data <- balanced_dataset[training_indices, ]
  test_data <- balanced_dataset[-training_indices, ]

  # Training the logistic regression model
  logistic_model <- glm(depressed ~ ., data = train_data, family = binomial())
  summary(logistic_model)
  
  # Predicting on the test data
  predictions <- predict(logistic_model, newdata = test_data, type = "response")
  predicted_classes <- factor(ifelse(predictions > 0.5, 1, 0), levels = c(0, 1))
  
  # Evaluating the model
  confusion_matrix <- confusionMatrix(predicted_classes, test_data$depressed)
  print(confusion_matrix)
}

# Main execution block
file_path <- "/Users/zijiehe/Desktop/STAT515FINAL/dep.xlsx"
data <- read_data(file_path)
## [1] "Number of rows: 1147"
## [1] "Number of columns: 35"
## [1] "Number of missing values: 0"
data <- remove_outliers(data)
data$depressed <- as.factor(data$depressed)
balanced_data <- downSampling(data)
perform_logistic_regression(balanced_data)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 20 15
##          1 17 22
##                                           
##                Accuracy : 0.5676          
##                  95% CI : (0.4472, 0.6823)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.1477          
##                                           
##                   Kappa : 0.1351          
##                                           
##  Mcnemar's Test P-Value : 0.8597          
##                                           
##             Sensitivity : 0.5405          
##             Specificity : 0.5946          
##          Pos Pred Value : 0.5714          
##          Neg Pred Value : 0.5641          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2703          
##    Detection Prevalence : 0.4730          
##       Balanced Accuracy : 0.5676          
##                                           
##        'Positive' Class : 0               
## 

5.3.1 Summary of Logistic model on balanced dataset (down sampling)

Summarizing the model performance

This logistic regression model performed mediocrely when dealing with a balanced dataset with an accuracy of 56.76%. This indicates that the model is not very effective in distinguishing between depressed and non-depressed states.The Kappa statistic of 0.1351 indicates that the model has average predictive power.

Accuracy: 56.76%. The overall accuracy of the model is low, indicating its limited discriminatory power.

95% CI (Confidence Interval): (44.72%, 68.23%). Confidence intervals are wide, indicating that estimates of model accuracy are not stable enough.

No Information Rate (NIR): 50%. Indicates that if the model does not have any valid information, the prediction accuracy is 50%.

Kappa: 0.1351. This value indicates that the predictive power of the model is not good.

Mcnemar’s Test P-Value: 0.8597, which indicates that the bias between positive and negative predictions is not significant, i.e., the model’s imbalance between the predictions of the two categories is not significant.

Sensitivity and Specificity:54.05% and 59.46%. These two indicators show that the model is weak in recognizing both positive and negative categories.

Positive Predictive Value, PPV and Negative Predictive Value, NPV: PPV is 57.14% and NPV is 56.41%, which indicates that the model is average in predicting correctness.

Balanced Accuracy: 56.76%, which indicates that the model is average in positive and negative class prediction.

5.3.2 Decision Tree model on balanced dataset (down sampling)

# Load necessary libraries
library(readxl)
library(caret)
library(dplyr)
library(rpart)
library(rattle)  # For fancyRpartPlot

# Read and prepare the dataset
read_data <- function(file_path) {
  data <- read_excel(file_path)
  print(paste("Number of rows:", nrow(data)))
  print(paste("Number of columns:", ncol(data)))
  print(paste("Number of missing values:", sum(is.na(data))))
  return(data)
}

# Remove outliers based on Z-score
remove_outliers <- function(dataset) {
  means <- apply(dataset, 2, mean, na.rm = TRUE)
  sds <- apply(dataset, 2, sd, na.rm = TRUE)
  z_scores <- abs(scale(dataset, center = means, scale = sds))
  return(dataset[apply(z_scores, 1, max, na.rm = TRUE) < 3, ])
}

# Downsample imbalanced data
downSampling <- function(dataset) {
  set.seed(111)
  downsampled_data <- downSample(x = dataset[, -ncol(dataset)], y = dataset$depressed)
  names(downsampled_data)[names(downsampled_data) == "Class"] <- "depressed"
  return(downsampled_data)
}

# Decision tree analysis on balanced dataset
perform_decision_tree <- function(balanced_dataset) {
  set.seed(123)
  training_indices <- createDataPartition(balanced_dataset$depressed, p = 0.7, list = FALSE)
  train_data <- balanced_dataset[training_indices, ]
  test_data <- balanced_dataset[-training_indices, ]

  tree_model <- rpart(depressed ~ ., data = train_data, method = "class")
  fancyRpartPlot(tree_model, main="Balanced Decision Tree")
  print(summary(tree_model))

  predicted_values <- predict(tree_model, newdata = test_data, type = "class")
  predicted_classes <- factor(predicted_values, levels = c(0, 1))
  actual_values <- factor(test_data$depressed, levels = c(0, 1))
  confusion_matrix <- confusionMatrix(predicted_classes, actual_values)
  print(confusion_matrix)
}

# Main execution block
file_path <- "/Users/zijiehe/Desktop/STAT515FINAL/dep.xlsx"
data <- read_data(file_path)
## [1] "Number of rows: 1147"
## [1] "Number of columns: 35"
## [1] "Number of missing values: 0"
data <- remove_outliers(data)
data$depressed <- as.factor(data$depressed)
balanced_data <- downSampling(data)
perform_decision_tree(balanced_data)

## Call:
## rpart(formula = depressed ~ ., data = train_data, method = "class")
##   n= 176 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.23863636      0 1.0000000 1.2045455 0.07378413
## 2 0.05113636      1 0.7613636 0.7954545 0.07378413
## 3 0.03409091      4 0.6022727 0.8863636 0.07488957
## 4 0.02272727      6 0.5340909 0.9204545 0.07513898
## 5 0.01136364      7 0.5113636 0.9090909 0.07506571
## 6 0.01000000      8 0.5000000 0.9090909 0.07506571
## 
## Variable importance
##          years_of_edu    med_sickdays_hhave           ed_expenses 
##                    14                    13                     9 
##               cons_ed           cons_social nondurable_investment 
##                     9                     7                     6 
##                   age        household_size              children 
##                     5                     4                     4 
##       cons_nondurable          cons_allfood           hh_children 
##                     4                     4                     4 
##    durable_investment        ent_total_cost     cons_med_children 
##                     3                     3                     3 
##            cons_other       ed_schoolattend     fs_chwholed_often 
##                     3                     3                     2 
##     fs_adwholed_often          cons_alcohol 
##                     1                     1 
## 
## Node number 1: 176 observations,    complexity param=0.2386364
##   predicted class=0  expected loss=0.5  P(node) =1
##     class counts:    88    88
##    probabilities: 0.500 0.500 
##   left son=2 (141 obs) right son=3 (35 obs)
##   Primary splits:
##       years_of_edu       < 6.5        to the right, improve=7.863830, (0 missing)
##       med_sickdays_hhave < 0.08333333 to the left,  improve=6.731935, (0 missing)
##       ed_expenses        < 51.20914   to the left,  improve=3.927273, (0 missing)
##       cons_allfood       < 186.4439   to the left,  improve=3.880071, (0 missing)
##       cons_social        < 3.269805   to the right, improve=3.705783, (0 missing)
##   Surrogate splits:
##       age                < 46.5       to the left,  agree=0.847, adj=0.229, (0 split)
##       household_size     < 1.5        to the right, agree=0.841, adj=0.200, (0 split)
##       children           < 0.5        to the right, agree=0.830, adj=0.143, (0 split)
##       med_sickdays_hhave < 6.955645   to the left,  agree=0.830, adj=0.143, (0 split)
##       fs_adwholed_often  < 5.25       to the left,  agree=0.812, adj=0.057, (0 split)
## 
## Node number 2: 141 observations,    complexity param=0.05113636
##   predicted class=0  expected loss=0.4255319  P(node) =0.8011364
##     class counts:    81    60
##    probabilities: 0.574 0.426 
##   left son=4 (28 obs) right son=5 (113 obs)
##   Primary splits:
##       med_sickdays_hhave     < 0.08333333 to the left,  improve=5.583452, (0 missing)
##       ed_expenses            < 45.48365   to the left,  improve=5.579527, (0 missing)
##       cons_ed                < 3.790304   to the left,  improve=4.362527, (0 missing)
##       cons_social            < 0.1648248  to the right, improve=3.451186, (0 missing)
##       asset_land_owned_total < 1.25       to the right, improve=2.760732, (0 missing)
##   Surrogate splits:
##       cons_alcohol      < 4.186875   to the right, agree=0.816, adj=0.071, (0 split)
##       cons_med_children < 7.206916   to the right, agree=0.809, adj=0.036, (0 split)
## 
## Node number 3: 35 observations
##   predicted class=1  expected loss=0.2  P(node) =0.1988636
##     class counts:     7    28
##    probabilities: 0.200 0.800 
## 
## Node number 4: 28 observations
##   predicted class=0  expected loss=0.1428571  P(node) =0.1590909
##     class counts:    24     4
##    probabilities: 0.857 0.143 
## 
## Node number 5: 113 observations,    complexity param=0.05113636
##   predicted class=0  expected loss=0.4955752  P(node) =0.6420455
##     class counts:    57    56
##    probabilities: 0.504 0.496 
##   left son=10 (102 obs) right son=11 (11 obs)
##   Primary splits:
##       ed_expenses     < 45.48365   to the left,  improve=4.167589, (0 missing)
##       ed_schoolattend < 0.7321429  to the left,  improve=3.073525, (0 missing)
##       cons_allfood    < 150.3553   to the left,  improve=3.063232, (0 missing)
##       cons_ed         < 3.790304   to the left,  improve=3.063232, (0 missing)
##       cons_nondurable < 155.5233   to the left,  improve=2.758684, (0 missing)
##   Surrogate splits:
##       cons_ed     < 3.790304   to the left,  agree=0.991, adj=0.909, (0 split)
##       cons_social < 10.10303   to the left,  agree=0.929, adj=0.273, (0 split)
## 
## Node number 10: 102 observations,    complexity param=0.05113636
##   predicted class=0  expected loss=0.4509804  P(node) =0.5795455
##     class counts:    56    46
##    probabilities: 0.549 0.451 
##   left son=20 (33 obs) right son=21 (69 obs)
##   Primary splits:
##       cons_social       < 0.7607301  to the right, improve=3.100054, (0 missing)
##       cons_med_children < 2.80633    to the right, improve=1.844910, (0 missing)
##       asset_durable     < 157.351    to the right, improve=1.593137, (0 missing)
##       cons_nondurable   < 55.43596   to the right, improve=1.551191, (0 missing)
##       cons_allfood      < 27.7962    to the right, improve=1.551191, (0 missing)
##   Surrogate splits:
##       cons_nondurable       < 55.43596   to the right, agree=0.902, adj=0.697, (0 split)
##       cons_allfood          < 27.7962    to the right, agree=0.902, adj=0.697, (0 split)
##       ent_total_cost        < 0.03336535 to the right, agree=0.892, adj=0.667, (0 split)
##       durable_investment    < 42.50397   to the right, agree=0.892, adj=0.667, (0 split)
##       nondurable_investment < 0.7240282  to the right, agree=0.892, adj=0.667, (0 split)
## 
## Node number 11: 11 observations
##   predicted class=1  expected loss=0.09090909  P(node) =0.0625
##     class counts:     1    10
##    probabilities: 0.091 0.909 
## 
## Node number 20: 33 observations
##   predicted class=0  expected loss=0.2727273  P(node) =0.1875
##     class counts:    24     9
##    probabilities: 0.727 0.273 
## 
## Node number 21: 69 observations,    complexity param=0.03409091
##   predicted class=1  expected loss=0.4637681  P(node) =0.3920455
##     class counts:    32    37
##    probabilities: 0.464 0.536 
##   left son=42 (61 obs) right son=43 (8 obs)
##   Primary splits:
##       hh_children       < 2.5        to the left,  improve=2.077037, (0 missing)
##       cons_ed           < 0.3002882  to the left,  improve=1.627315, (0 missing)
##       fs_chwholed_often < 0.3434622  to the right, improve=1.620659, (0 missing)
##       cons_med_children < 0.6880889  to the right, improve=1.489211, (0 missing)
##       years_of_edu      < 10.5       to the right, improve=1.196034, (0 missing)
##   Surrogate splits:
##       cons_ed               < 0.3002882  to the left,  agree=0.971, adj=0.750, (0 split)
##       cons_other            < 20.05925   to the left,  agree=0.971, adj=0.750, (0 split)
##       ed_schoolattend       < 0.25       to the left,  agree=0.971, adj=0.750, (0 split)
##       ed_expenses           < 3.603458   to the left,  agree=0.957, adj=0.625, (0 split)
##       nondurable_investment < 0.854153   to the left,  agree=0.957, adj=0.625, (0 split)
## 
## Node number 42: 61 observations,    complexity param=0.03409091
##   predicted class=0  expected loss=0.4918033  P(node) =0.3465909
##     class counts:    31    30
##    probabilities: 0.508 0.492 
##   left son=84 (40 obs) right son=85 (21 obs)
##   Primary splits:
##       age                < 26.5       to the right, improve=1.0370410, (0 missing)
##       household_size     < 4.5        to the right, improve=1.0089660, (0 missing)
##       med_sickdays_hhave < 1.6844     to the left,  improve=0.9841780, (0 missing)
##       years_of_edu       < 10.5       to the right, improve=0.7503067, (0 missing)
##       marital_status     < 0.5        to the right, improve=0.5608942, (0 missing)
##   Surrogate splits:
##       household_size     < 4.5        to the right, agree=0.738, adj=0.238, (0 split)
##       children           < 2.5        to the right, agree=0.721, adj=0.190, (0 split)
##       fs_chwholed_often  < 0.2787356  to the right, agree=0.721, adj=0.190, (0 split)
##       fs_meat            < 1.5        to the right, agree=0.689, adj=0.095, (0 split)
##       med_sickdays_hhave < 2.018315   to the left,  agree=0.689, adj=0.095, (0 split)
## 
## Node number 43: 8 observations
##   predicted class=1  expected loss=0.125  P(node) =0.04545455
##     class counts:     1     7
##    probabilities: 0.125 0.875 
## 
## Node number 84: 40 observations,    complexity param=0.02272727
##   predicted class=0  expected loss=0.425  P(node) =0.2272727
##     class counts:    23    17
##    probabilities: 0.575 0.425 
##   left son=168 (20 obs) right son=169 (20 obs)
##   Primary splits:
##       cons_med_children < 1.249358   to the right, improve=1.2500000, (0 missing)
##       age               < 35.5       to the left,  improve=0.8632832, (0 missing)
##       children          < 3.5        to the left,  improve=0.8632832, (0 missing)
##       fs_chwholed_often < 0.3642956  to the left,  improve=0.2317043, (0 missing)
##       household_size    < 6.5        to the left,  improve=0.1928571, (0 missing)
##   Surrogate splits:
##       children           < 3.5        to the left,  agree=0.825, adj=0.65, (0 split)
##       med_sickdays_hhave < 1.525668   to the right, agree=0.800, adj=0.60, (0 split)
##       fs_chwholed_often  < 0.3642956  to the left,  agree=0.775, adj=0.55, (0 split)
##       household_size     < 5.5        to the left,  agree=0.725, adj=0.45, (0 split)
##       years_of_edu       < 8.5        to the left,  agree=0.625, adj=0.25, (0 split)
## 
## Node number 85: 21 observations
##   predicted class=1  expected loss=0.3809524  P(node) =0.1193182
##     class counts:     8    13
##    probabilities: 0.381 0.619 
## 
## Node number 168: 20 observations
##   predicted class=0  expected loss=0.3  P(node) =0.1136364
##     class counts:    14     6
##    probabilities: 0.700 0.300 
## 
## Node number 169: 20 observations,    complexity param=0.01136364
##   predicted class=1  expected loss=0.45  P(node) =0.1136364
##     class counts:     9    11
##    probabilities: 0.450 0.550 
##   left son=338 (7 obs) right son=339 (13 obs)
##   Primary splits:
##       age               < 42.5       to the right, improve=0.3175824, (0 missing)
##       children          < 5          to the left,  improve=0.1500000, (0 missing)
##       household_size    < 6.5        to the left,  improve=0.1500000, (0 missing)
##       years_of_edu      < 9.5        to the left,  improve=0.1000000, (0 missing)
##       cons_med_children < 1.136585   to the left,  improve=0.1000000, (0 missing)
##   Surrogate splits:
##       children          < 3          to the left,  agree=0.80, adj=0.429, (0 split)
##       cons_med_children < 0.1501441  to the left,  agree=0.80, adj=0.429, (0 split)
##       fs_chwholed_often < 0.1666667  to the left,  agree=0.80, adj=0.429, (0 split)
##       cons_nondurable   < 49.48167   to the right, agree=0.75, adj=0.286, (0 split)
##       asset_durable     < 69.1864    to the right, agree=0.75, adj=0.286, (0 split)
## 
## Node number 338: 7 observations
##   predicted class=0  expected loss=0.4285714  P(node) =0.03977273
##     class counts:     4     3
##    probabilities: 0.571 0.429 
## 
## Node number 339: 13 observations
##   predicted class=1  expected loss=0.3846154  P(node) =0.07386364
##     class counts:     5     8
##    probabilities: 0.385 0.615 
## 
## n= 176 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 176 88 0 (0.50000000 0.50000000)  
##     2) years_of_edu>=6.5 141 60 0 (0.57446809 0.42553191)  
##       4) med_sickdays_hhave< 0.08333333 28  4 0 (0.85714286 0.14285714) *
##       5) med_sickdays_hhave>=0.08333333 113 56 0 (0.50442478 0.49557522)  
##        10) ed_expenses< 45.48365 102 46 0 (0.54901961 0.45098039)  
##          20) cons_social>=0.7607301 33  9 0 (0.72727273 0.27272727) *
##          21) cons_social< 0.7607301 69 32 1 (0.46376812 0.53623188)  
##            42) hh_children< 2.5 61 30 0 (0.50819672 0.49180328)  
##              84) age>=26.5 40 17 0 (0.57500000 0.42500000)  
##               168) cons_med_children>=1.249358 20  6 0 (0.70000000 0.30000000) *
##               169) cons_med_children< 1.249358 20  9 1 (0.45000000 0.55000000)  
##                 338) age>=42.5 7  3 0 (0.57142857 0.42857143) *
##                 339) age< 42.5 13  5 1 (0.38461538 0.61538462) *
##              85) age< 26.5 21  8 1 (0.38095238 0.61904762) *
##            43) hh_children>=2.5 8  1 1 (0.12500000 0.87500000) *
##        11) ed_expenses>=45.48365 11  1 1 (0.09090909 0.90909091) *
##     3) years_of_edu< 6.5 35  7 1 (0.20000000 0.80000000) *
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 19 15
##          1 18 22
##                                           
##                Accuracy : 0.5541          
##                  95% CI : (0.4339, 0.6698)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.2080          
##                                           
##                   Kappa : 0.1081          
##                                           
##  Mcnemar's Test P-Value : 0.7277          
##                                           
##             Sensitivity : 0.5135          
##             Specificity : 0.5946          
##          Pos Pred Value : 0.5588          
##          Neg Pred Value : 0.5500          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2568          
##    Detection Prevalence : 0.4595          
##       Balanced Accuracy : 0.5541          
##                                           
##        'Positive' Class : 0               
## 

5.3.2 Summary of Decision Tree model on balanced dataset (down sampling)

Accuracy: 55.41%

Kappa: 0.1081, the model’s ability to predict is not good.

Node: Split based on years_of_edu, which suggests that this variable is an important factor in distinguishing between the two categories (depressed or not).

FIRST LEVEL SEGMENTATION: Further segmentation is done based on med_sickdays_have which shows that health status is also an important factor that affects depression status.

Deeper nodes: Various variables such as ed_expenses, cons_social etc. are used in deeper nodes which shows that the model tries to categorize through several different features to increase the accuracy of decision making.

6.Conclusions

results <- data.frame(
  Condition = c("Imbalanced", "Imbalanced", "Imbalanced", 
                "Balanced - Upsampling", "Balanced - Upsampling", 
                "Balanced - Downsampling", "Balanced - Downsampling"),
  Model = c("Logistic Regression", "Weighted Logistic Regression", "Decision Tree",
            "Logistic Regression", "Decision Tree", 
            "Logistic Regression", "Decision Tree"),
  Accuracy = c(0.8436, 0.7469, 0.823, 0.6359, 0.7087, 0.5676, 0.5541),
  Kappa = c(0.0278, 0.2413, -0.0102, 0.2718, 0.4175, 0.1351, 0.1081),
  Sensitivity = c(0.99029, 0.7883, 0.96602, 0.6893, 0.6699, 0.5405, 0.5135),
  Specificity = c(0.02703, 0.52, 0.02703, 0.5825, 0.7476, 0.5946, 0.5946)
)

kable(results, caption = "Performance Metrics for Various Models",
      col.names = c("Condition", "Model", "Accuracy", "Kappa", "Sensitivity", "Specificity"),
      format = "html", align = c('l', 'l', 'c', 'c', 'c', 'c')) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE, position = "left") %>%
  row_spec(5, background = "#90EE90")  # Highlight the fifth row with a yellow background
Performance Metrics for Various Models
Condition Model Accuracy Kappa Sensitivity Specificity
Imbalanced Logistic Regression 0.8436 0.0278 0.99029 0.02703
Imbalanced Weighted Logistic Regression 0.7469 0.2413 0.78830 0.52000
Imbalanced Decision Tree 0.8230 -0.0102 0.96602 0.02703
Balanced - Upsampling Logistic Regression 0.6359 0.2718 0.68930 0.58250
Balanced - Upsampling Decision Tree 0.7087 0.4175 0.66990 0.74760
Balanced - Downsampling Logistic Regression 0.5676 0.1351 0.54050 0.59460
Balanced - Downsampling Decision Tree 0.5541 0.1081 0.51350 0.59460
data <- read.csv(textConnection("
Condition,Model,Accuracy,Kappa,Sensitivity,Specificity
Imbalanced,Logistic Regression,0.8436,0.0278,0.99029,0.02703
Imbalanced,Weighted Logistic Regression,0.7469,0.2413,0.78830,0.52000
Imbalanced,Decision Tree,0.8230,-0.0102,0.96602,0.02703
Balanced - Upsampling,Logistic Regression,0.6359,0.2718,0.68930,0.58250
Balanced - Upsampling,Decision Tree,0.7087,0.4175,0.66990,0.74760
Balanced - Downsampling,Logistic Regression,0.5676,0.1351,0.54050,0.59460
Balanced - Downsampling,Decision Tree,0.5541,0.1081,0.51350,0.59460
"))

data_long <- data %>%
  pivot_longer(cols = c("Accuracy", "Kappa", "Sensitivity", "Specificity"),
               names_to = "Metric", values_to = "Value")

p <- ggplot(data_long, aes(x = Condition, y = Value, color = Model, group = Model)) +
  geom_line() +
  geom_point() +
  facet_wrap(~ Metric, scales = "free_y") +
  labs(title = "Model Performance Comparison",
       x = "Condition",
       y = "Value") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5),
        plot.margin = margin(t = 10, r = 10, b = 20, l = 10, unit = "pt"))

# plotly
gg_plotly <- ggplotly(p)
gg_plotly

Using a variety of modeling scenarios, we observed significant differences in performance between models trained on imbalanced datasets and those adjusted for balance through upsampling or downsampling techniques. Within the imbalanced data category, Logistic Regression offers high accuracy and sensitivity, but at the expense of exceptionally low specificity, indicating a likely overfitting to the majority class. According to this skew, the model is able to predict the majority class well, but it does not adequately recognize instances of the minority class. However, Weighted Logistic Regression, which adjusts the model’s focus to better account for minority classes, improves specificity significantly. As a result, this adjustment reduces overall accuracy and sensitivity, demonstrating the trade-offs associated with addressing class imbalances.

The Decision Tree model stands out among the Balanced - Upsampling models, especially due to its superior Kappa score of 0.4175, which suggests that the predictions of the model and the actual data are well aligned, beyond what would be expected by chance. Furthermore, this model achieves the highest specificity among all models at 0.7476, demonstrating its efficiency in identifying true negatives, which is crucial when analyzing a balanced dataset with no dominant class. Based on the performance of the Decision Tree in this scenario, it is apparent that it is capable of tackling the complexity of a balanced dataset, leveraging its ability to manage non-linear relationships and intricate features interactions.

Conversely, we note a degradation in performance for both models tested in the Balanced - Downsampling scenario. In this case, the downsampling likely contributed to the reduction in data variability and richness, which is crucial to training, thus underscoring the downsides of downsampling as a balancing method.

Based on these insights, the Decision Tree model trained on the upsampled balanced dataset is particularly recommended. The recommendation is based on the model’s robust Kappa score, which indicates not just random chance agreement but a substantive alignment with the balanced nature of the dataset. An advantage of upsampling is that it preserves the entirety of the dataset, thereby enhancing the ability of the model to learn from all available nuances, unlike downsampling, which might omit crucial information. Inherently, decision trees have the ability to select significant features and optimize the split of data, so they are ideally suited for situations where class representation requires careful consideration.